Conference Paper

Concept Detection in Multimedia Web Resources About Home Made Explosives

Conference Paper

Concept Detection in Multimedia Web Resources About Home Made Explosives

If you want to read the PDF, try requesting it from the authors.

Abstract

This work investigates the effectiveness of a state-of-the-art concept detection framework for the automatic classification of multimedia content, namely images and videos, embedded in publicly available Web resources containing recipes for the synthesis of Home Made Explosives (HMEs), to a set of predefined semantic concepts relevant to the HME domain. The concept detection framework employs advanced methods for video (shot) segmentation, visual feature extraction (using SIFT, SURF, and their variations), and classification based on machine learning techniques (logistic regression). The evaluation experiments are performed using an annotated collection of multimedia HME content discovered on the Web, and a set of concepts, which emerged both from an empirical study, and were also provided by domain experts and interested stakeholders, including Law Enforcement Agencies personnel. The experiments demonstrate the satisfactory performance of our framework, which in turn indicates the significant potential of the adopted approaches on the HME domain.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In order to mitigate the widespread usage of the Internet for such malevolent intentions, authorities are deploying sophisticated systems that automatically analyse content on the Web and on social media [18]. Thus, several counterterrorism systems already exist for analysing text [8], multimedia [17], as well as social media content [4,13,14] on the Web. ...
Conference Paper
The Web and social media nowadays play an increasingly significant role in spreading terrorism-related propaganda and content. In order to deploy counterterrorism measures, authorities rely on automated systems for analysing text, multimedia, and social media content on the Web. However, since each of these systems is an isolated solution, investigators often face the challenge of having to cope with a diverse array of heterogeneous sources and formats that generate vast volumes of data. Semantic Web technologies can alleviate this problem by delivering a toolset of mechanisms for knowledge representation, information fusion, semantic search, and sophisticated analyses of terrorist networks and spatiotemporal information. In the Semantic Web environment, ontologies play a key role by offering a shared, uniform model for semantically integrating information from multimodal heterogeneous sources. An additional benefit is that ontologies can be augmented with powerful tools for semantic enrichment and reasoning. This paper presents such a unified semantic infrastructure for information fusion of terrorism-related content and threat detection on the Web. The framework is deployed within the TENSOR EU-funded project, and consists of an ontology and an adaptable semantic reasoning mechanism. We strongly believe that, in the short- and long-term, these techniques can greatly assist Law Enforcement Agencies in their investigational operations.
... With regard to research efforts related to discovering and analyzing HME Web content, a concept detection mechanism has been developed in the context of HOMER project with the goal of identifying the relevance of already discovered multimedia files (videos/images) to the HME domain in an automatic fashion [11]; however, this work has solely addressed the identification of HME-related objects in multimedia, rather than the discovery of such content on the Web. In addition, a Knowledge Management Platform for managing the discovery, analysis, and retrieval of HME-related content has been developed [12]; nevertheless, this effort has mainly addressed issues related to the architecture of the entire framework for the HME knowledge management, rather than the discovery of HME-related Web content. ...
Article
Full-text available
Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of interest. This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly navigate through the Surface Web and several darknets present in the Dark Web (i.e., Tor, I2P, and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the destination network type and the strength of the local evidence present in the vicinity of a hyperlink. It investigates 11 hyperlink selection methods, among which a novel strategy proposed based on the dynamic linear combination of a link-based and a parent Web page classifier. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed focused crawler both for the Surface and the Dark Web.
Conference Paper
This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly traverse the Surface Web and several darknets present in the Dark Web (i.e. Tor, I2P and Freenet) during a single crawl by automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the network type. This hybrid focused crawler is demonstrated for the discovery of Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the effectiveness of the proposed ap-proach both for the Surface and the Dark Web.
Conference Paper
This work investigates the effectiveness of a novel interactive search engine in the context of discovering and retrieving Web resources containing recipes for synthesizing Home Made Explosives (HMEs). The discovery of HME Web resources both on Surface and Dark Web is addressed as a domain-specific search problem; the architecture of the search engine is based on a hybrid infrastructure that combines two different approaches: (i) a Web crawler focused on the HME domain; (ii) the submission of HME domain-specific queries to general-purpose search engines. Both approaches are accompanied by a user-initiated post-processing classification for reducing the potential noise in the discovery results. The design of the application is built based on the distinctive nature of law enforcement agency user requirements, which dictate the interactive discovery and the accurate filtering of Web resources containing HME recipes. The experiments evaluating the effectiveness of our application demonstrate its satisfactory performance, which in turn indicates the significant potential of the adopted approaches on the HME domain.
Article
Full-text available
The design of a methodology for the effective scene understanding systems is one of the main goals of the researchers in the analysis of video surveillance. The objects in the scene have to be identified. Hence, it is necessary to detect the parts belonging to the background. In the article we introduce the base algorithms, which enable us to realization of scenarios. We briefly describe base algorithms (object detection, object localization, recognition of humans, movement detection and configuration of scene) used in three selected scenarios: violation of protected zones, abandoned objects and vandalism (graffiti). These scenarios were tested on several films, obtained from Internet and made by participants of project SIMPOZ. The results of our experiments are presented. The basic algorithms for detecting and locating objects are very quickly, but movement detection ("optical flow") and recognition of humans algorithms work longer.
Conference Paper
Full-text available
A major computational bottleneck in many current algorithms is the evaluation of arbitrary boxes. Dense local analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLADs difference coding, even with 2 and power-norms. Finally, by multiple codeword assignments, we achieve exact and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the-art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the CUB-2011 200 bird species. Plus, we rank number one in the official ImageNet 2013 detection challenge.
Conference Paper
Full-text available
This paper introduces an algorithm for fast temporal seg-mentation of videos into shots. The proposed method detects abrupt and gradual transitions, based on the visual similar-ity of neighboring frames of the video. The descriptive ef-ficiency of both local (SURF) and global (HSV histograms) descriptors is exploited for assessing frame similarity, while GPU-based processing is used for accelerating the analysis. Specifically, abrupt transitions are initially detected between successive video frames where there is a sharp change in the visual content, which is expressed by a very low similarity score. Then, the calculated scores are further analysed for the identification of frame-sequences where a progressive change of the visual content takes place and, in this way gradual tran-sitions are detected. Finally, a post-processing step is per-formed aiming to identify outliers due to object/camera move-ment and flash-lights. The experiments show that the pro-posed algorithm achieves high accuracy while being capable of faster-than-real-time analysis.
Article
Full-text available
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from http://www.colordescriptors.com) in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Conference Paper
Full-text available
Shot segmentation provides the basis for almost all high-level video content analysis approaches, validating it as one of the major prerequisites for efficient video semantic analysis, indexing and retrieval. The successful detection of both gradual and abrupt transitions is necessary to this end. In this paper a new gradual transition detection algorithm is proposed, that is based on novel criteria such as color coherence change that exhibit less sensitivity to local or global motion than previously proposed ones. These criteria, each of which could serve as a standalone gradual transition detection approach, are then combined using a machine learning technique, to result in a meta-segmentation scheme. Besides significantly improved performance, advantage of the proposed scheme is that there is no need for threshold selection, as opposed to what would be the case if any of the proposed features were used by themselves and as is typically the case in the relevant literature. Performance evaluation and comparison with four other popular algorithms reveals the effectiveness of the proposed technique.
Conference Paper
Full-text available
Intelligent tasks, such as visual perception, auditory perception, and language understanding require the construction of good internal representations of the world (or "features")? which must be invariant to irrelevant variations of the input while, preserving relevant information. A major question for Machine Learning is how to learn such good features automatically. Convolutional Networks (ConvNets) are a biologically-inspired trainable architecture that can learn invariant features. Each stage in a ConvNets is composed of a filter bank, some nonlinearities, and feature pooling layers. With multiple stages, a ConvNet can learn multi-level hierarchies of features. While ConvNets have been successfully deployed in many commercial applications from OCR to video surveillance, they require large amounts of labeled training samples. We describe new unsupervised learning algorithms, and new non-linear stages that allow ConvNets to be trained with very few labeled samples. Applications to visual object recognition and vision navigation for off-road mobile robots are described.
Conference Paper
Full-text available
The Fisher kernel (FK) is a generic framework which com- bines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained us- ing only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant re- sources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
Conference Paper
Full-text available
Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.
Article
Full-text available
In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.
Article
Full-text available
This paper deals with the problem of image retrieval from large image databases. A particularly interesting problem is the retrieval of all images which are similar to one in the user's mind, taking into account his/her feedback which is expressed as positive or negative preferences for the images that the system progressively shows during the search. Here we present a novel algorithm for the incorporation of user preferences in an image retrieval system based exclusively on the visual content of the image, which is stored as a vector of low-level features. The algorithm considers the probability of an image belonging to the set of those sought by the user, and models the logit of this probability as the output of a generalized linear model whose inputs are the low-level image features. The image database is ranked by the output of the model and shown to the user, who selects a few positive and negative samples, repeating the process in an iterative way until he/she is satisfied. The problem of the small sample size with respect to the number of features is solved by adjusting several partial generalized linear models and combining their relevance probabilities by means of an ordered averaged weighted operator. Experiments were made with 40 users and they exhibited good performance in finding a target image (4 iterations on average) in a database of about 4700 images. The mean number of positive and negative examples is of 4 and 6 per iteration. A clustering of users into sets also shows consistent patterns of behavior.
Article
Full-text available
This paper addresses the problem of large-scale image search. Three constraints have to be taken into account: search accuracy, efficiency, and memory usage. We first present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension. We then jointly optimize dimensionality reduction and indexing in order to obtain a precise vector comparison as well as a compact representation. The evaluation shows that the image representation can be reduced to a few dozen bytes while preserving high accuracy. Searching a 100 million image dataset takes about 250 ms on one processor core.
Article
Full-text available
The TREC Video Retrieval Evaluation (TRECVid) is an international benchmarking activity to encourage research in video information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. TRECVid completed its fifth annual cycle at the end of 2005 and in 2006 TRECVid will involve almost 70 research organizations, universities and other consortia. Throughout its existence, TRECVid has benchmarked both interactive and automatic/manual searching for shots from within a video corpus, automatic detection of a variety of semantic and low-level video features, shot boundary detection and the detection of story boundaries in broadcast TV news. This paper will give an introduction to information retrieval (IR) evaluation from both a user and a system perspective, highlighting that system evaluation is by far the most prevalent type of evaluation carried out. We also include a summary of TRECVid as an example of a system evaluation benchmarking campaign and this allows us to discuss whether such campaigns are a good thing or a bad thing. There are arguments for and against these campaigns and we present some of them in the paper concluding that on balance they have had a very positive impact on research progress.
Article
Full-text available
This article presents a new system for automatically extracting high-level video concepts. The novelty of the approach lies in the feature fusion method. The system architecture is divided into three steps. The first step consists in creating sensors from a low-level (color or texture) descriptor, and a Support Vector Machine (SVM) learning to recognize a given concept (for example, “beach” or “road”). The sensor fusion step is the combination of several sensors for each concept. Finally, as the concepts depend on context, the concept fusion step models interaction between concepts in order to modify their prediction. The fusion method is based on the Transferable Belief Model (TBM). It offers an appropriate framework for modeling source uncertainty and interaction between concepts. Results obtained on TREC video protocol demonstrate the improvement provided by such a combination, compared to mono-source information.
Conference Paper
Full-text available
In this paper we address the problem of sports video classification using hidden Markov models (HMMs). For each sports genre, we construct two HMMs representing motion and color features respectively. The observation sequences generated from the principal motion direction and the principal color of each frame are fed to a motion and a color HMM respectively. The outputs are integrated to make a final decision. We tested our scheme on 220 minutes of sports video with four genre types: ice hockey, basketball, football, and soccer, and achieved an overall classification accuracy of 93%.
Article
Full-text available
The aim of salient feature detection is to find distinctive local events in images. Salient features are generally determined from the local differential structure of images. They focus on the shape-saliency of the local neighborhood. The majority of these detectors are luminance-based, which has the disadvantage that the distinctiveness of the local color information is completely ignored in determining salient image features. To fully exploit the possibilities of salient point detection in color images, color distinctiveness should be taken into account in addition to shape distinctiveness. In this paper, color distinctiveness is explicitly incorporated into the design of saliency detection. The algorithm, called color saliency boosting, is based on an analysis of the statistics of color image derivatives. Color saliency boosting is designed as a generic method easily adaptable to existing feature detectors. Results show that substantial improvements in information content are acquired by targeting color salient features.
Article
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from http://www.colordescriptors.com) in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Article
In this paper, we deal with the problem of extending and using different local descriptors, as well as exploiting concept correlations, toward improved video semantic concept detection. We examine how the state-of-the-art binary local descriptors can facilitate concept detection, we propose color extensions of them inspired by previously proposed color extensions of scale invariant feature transform, and we show that the latter color extension paradigm is generally applicable to both binary and nonbinary local descriptors. In order to use them in conjunction with a state-of-the-art feature encoding, we compact the above color extensions using PCA and we compare two alternatives for doing this. Concerning the learning stage of concept detection, we perform a comparative study and propose an improved way of employing stacked models, which capture concept correlations, using multilabel classification algorithms in the last layer of the stack. We examine and compare the effectiveness of the above algorithms in both semantic video indexing within a large video collection and in the somewhat different problem of individual video annotation with semantic concepts, on the extensive video data set of the 2013 TRECVID Semantic Indexing Task. Several conclusions are drawn from these experiments on how to improve the video semantic concept detection.
Conference Paper
In this work we deal with the problem of how different local descriptors can be extended, used and combined for improving the effectiveness of video concept detection. The main contributions of this work are: 1) We examine how effectively a binary local descriptor, namely ORB, which was originally proposed for similarity matching between local image patches, can be used in the task of video concept detection. 2) Based on a previously proposed paradigm for introducing color extensions of SIFT, we define in the same way color extensions for two other non-binary or binary local descriptors (SURF, ORB), and we experimentally show that this is a generally applicable paradigm. 3) In order to enable the efficient use and combination of these color extensions within a state-of-the-art concept detection methodology (VLAD), we study and compare two possible approaches for reducing the color descriptor’s dimensionality using PCA. We evaluate the proposed techniques on the dataset of the 2013 Semantic Indexing Task of TRECVID.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.
Conference Paper
We address the problem of attribute-based people search in real surveillance environments. The system we developed is capable of answering user queries such as "show me all people with a beard and sunglasses, wearing a white hat and a patterned blue shirt, from all metro cameras in the downtown area, from 2pm to 4pm last Saturday". In this paper, we describe the lessons we learned from practical deployments of our system, and how we made our algorithms achieve the accuracy and efficiency required by many police departments around the world. In particular, we show that a novel set of multimodal integral filters and proper normalization of attribute scores are critical to obtain good performance. We conduct a comprehensive experimental analysis on video footage captured from a large set of surveillance cameras monitoring metro chokepoints, in both crowded and normal activity periods. Moreover, we show impressive results using images from the recent Boston marathon bombing event, where our system can rapidly retrieve the two suspects based on their attributes from a database containing more than one thousand people present at the event.
Article
Video shot boundary detection is a fundamental step for content base video analysis and has been widely studied in recent years. In this paper, a novel shot boundary detection algorithm is proposed. The algorithm uses feature space kernel smoothing to segment video into shots. The method is demonstrated to have high accuracy in both cuts and gradual transition detection. For flashlights and fast background change which cause the false detection, a post refinement strategy using local feature analysis is also studied in this paper. The effectiveness of the method is also verified by the experiments.
Article
Automatic clustering of video shots is an important issue of video abstraction, browsing and retrieval. Most of the existing shot clustering algorithms need some prior domain knowledge or thresholds to obtain good clustering results, and they also have to face the difficult task of choosing proper initial cluster centers. To resolve the discommodi-ous problems for users, this article proposes a robust un-supervised shot clustering algorithm which is called CAVS (Clustering Algorithm for Video Shots). In CAVS, multi-resolution analysis and Haar wavelet transformations are first applied as a dimensionality reduction approach for the high-dimensional feature vectors of shots. Then CAVS per-forms on the remained subspace and merges the most similar shots into one cluster by the iterative merging procedures. The iterative merging procedures are repeated until a novel stop criterion based on the theory of Fisher Discriminant Function is satisfied, and the clustering results and the num-ber of clusters are obtained without any parameters.
Article
Many video programs have story structures that can be recognized through the clustering of video contents based on low-level visual primitives and the analysis of high-level structures imposed by temporal arrangement of composing elements. In this paper we propose techniques and formulations to match and cluster video shots of similar visual contents, taking into account the visual characteristics and temporal dynamics of video. In addition, we extend theScene Transition Graphrepresentation for the analysis of temporal structures extracted from video. The analyses lead to automatic segmentation of scenes and story units that cannot be achieved with existing shot boundary detection schemes and the building of a compact representation of video contents. Furthermore, the segmentation can be performed on a much reduced data set extracted from compressed video and works well on a wide variety of video programming types. Hence, we are able to decompose video into meaningful hierarchies and compact representations that reflect the flow of the story. This offers a mean for the efficient browsing and organization of video.
Article
In this paper, we present a method to represent achromatic and chromatic image signals independently for content-based image indexing and retrieval for image database applications. Starting from an opponent colour representation, human colour vision theories and modern digital signal processing technologies are applied to develop a compact and computationally efficient visual appearance model for coloured image patterns. We use the model to compute the statistics of achromatic and chromatic spatial patterns of colour images for indexing and content-based retrieval. Two types of colour images databases, one colour texture database and another photography colour image database are used to evaluate the performance of the developed method in content-based image indexing and retrieval. Experimental results are presented to show that the new method is superior or competitive to state-of-the-art content-based image indexing and retrieval techniques.
Conference Paper
In this paper we investigate the problem of automatically identifying the genre of TV programmes. The approach here proposed is based on two foundations: Gaussian mixture models (GMMs) and artificial neural networks (ANNs). Firstly, we use Gaussian mixtures to model the probability distributions of low-level audiovisual features. Secondly, we use the parameters of each mixture model as new feature vectors. Finally, we train a multilayer perceptron (MLP), using GMM parameters as input data, to identify seven television programme genres. We evaluated the effectiveness of the proposed approach testing our system on a large set of data, summing up to more than 100 hours of broadcasted programmes.
Article
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield. These novel measures are defined and discussed and their use is demonstrated in a case study using TREC data: sample system run results for 20 queries in TREC-7. As a relevance base we used novel graded relevance judgments on a four-point scale. The test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. The graphs based on the measures also provide insight into the performance IR techniques and allow interpretation, for example, from the user point of view.
Article
We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
Bag-of-words model is implemented and tried on 10-class visual concept detection problem. The experimental results show that "DURF+ERT+SVM" outperforms "SIFT+ERT+SVM" both in detection performance and computation efficiency. Besides, combining DURF and SIFT results in even better detection performance. Real-time object detection using SIFT and RANSAC is also tried on simple objects, e.g. drink can, and good result is achieved.
Article
We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos
Conference Paper
An algorithm for fast shot boundary detection based on SVM (Support Vector Machine) is proposed. In this algorithm, smooth intervals inside shots are first eliminated from the original video frame sequence. During this process, a gray variance sequence is extracted and serves as the basis for the detection of smooth intervals. Then a new frame sequence named RFS (Reordered Frame Sequence) is formed. Video features such as intensity pixel-wise difference, color histogram differences in HSV space and edge histogram differences in X and Y direction are extracted from RFS. These features are presented as input vectors to the SVM to implement the cut detection. After the cut detection, temporal multi-resolution is applied to frames in RFS to fulfill the gradual change detection. Experimental results show that the proposed algorithm can implement the real-time shot boundary detection, while guarantee the performance.
Conference Paper
With concerns about terrorism and global security on the rise, it has become vital to have in place efficient threat detection systems that can detect and recognize potentially dangerous situations, and alert the authorities to take appropriate action. Of particular significance is the case of unattended objects in mass transit areas. This paper describes a general framework that recognizes the event of someone leaving a piece of baggage unattended in forbidden areas. Our approach involves the recognition of four sub-events that characterize the activity of interest. When an unaccompanied bag is detected, the system analyzes its history to determine its most likely owner(s), where the owner is defined as the person who brought the bag into the scene before leaving it unattended. Through subsequent frames, the system keeps a lookout for the owner, whose presence in or disappearance from the scene defines the status of the bag, and decides the appropriate course of action. The system was successfully tested on the i-LIDS dataset.
Conference Paper
We propose a method to video segmentation via active learning. Shot segmentation is an essential first step to video segmentation. The color histogram-based shot boundary detection algorithm is one of the most reliable variants of histogram-based detection algorithms. It is not unreasonable to assume that the color content does not change rapidly within but across shots. Thus, we present a metric based on blocked color histogram (BCH) for inter-frame difference. Our metric is the normalized intersection of BCH between contiguous frames. Hard cuts and gradual shot transitions can be detected as valleys in the time series of the differences between color histograms of contiguous frames or of frames a certain distance apart. We try to estimate the valleys on the frame-to-frame difference curve. Each kind of shot transition (cut or gradual shot transition) has its own characteristic pattern corresponding with valleys. Therefore shot detection can be viewed as pattern recognition. We employ the support vector machine (SVM) via active learning to classify shot boundaries and non-boundaries. Our method is evaluated on the TRECVID benchmarking platform and the experimental results reveal the effectiveness and robustness of the method.
A scene can be defined as one of the subdivisions of a play in which the setting is fixed, or when it presents continuous action in one place. We propose a novel two-pass algorithm for scene boundary detection, which utilizes the motion content, shot length and color properties of shots as the features. In our approach, shots are first clustered by computing Backward Shot Coherence (BSC) - a shot color similarity measure that detects Potential Scene Boundaries (PSBs) in the videos. In the second pass we compute Scene Dynamics (SD), a function of shot length and the motion content in the potential scenes. In this pass, a scene merging criteria has been developed to remove weak PSBs in order to reduce over segmentation. We also propose a method to describe the content of each scene by selecting one representative image. The segmentation of video data into number of scenes facilitates an improved browsing of videos in electronic form, such as video on demand, digital libraries, Internet. The proposed algorithm has been tested on a variety of videos that include five Hollywood movies, one sitcom, and one interview program and promising results have been obtained.
Conference Paper
This paper presents the work done on video sequence interpretation. We propose a framework based on two kinds of a priori knowledge: predefined scenarios and contextual information. This approach has been applied on video sequences of the AVS-PV visual surveillance European project
Article
Videos are composed of many shots that are caused by different camera operations, e.g., on/off operations and switching between cameras. One important goal in video analysis is to group the shots into temporal scenes, such that all the shots in a single scene are related to the same subject, which could be a particular physical setting, an ongoing action or a theme. In this paper, we present a general framework for temporal scene segmentation in various video domains. The proposed method is formulated in a statistical fashion and uses the Markov chain Monte Carlo (MCMC) technique to determine the boundaries between video scenes. In this approach, a set of arbitrary scene boundaries are initialized at random locations and are automatically updated using two types of updates: diffusion and jumps. Diffusion is the process of updating the boundaries between adjacent scenes. Jumps consist of two reversible operations: the merging of two scenes and the splitting of an existing scene. The posterior probability of the target distribution of the number of scenes and their corresponding boundary locations is computed based on the model priors and the data likelihood. The updates of the model parameters are controlled by the hypothesis ratio test in the MCMC process, and the samples are collected to generate the final scene boundaries. The major advantage of the proposed framework is two-fold: 1) it is able to find the weak boundaries as well as the strong boundaries, i.e., it does not rely on the fixed threshold; 2) it can be applied to different video domains. We have tested the proposed method on two video domains: home videos and feature films, and accurate results have been obtained
Article
We describe an approach to generalize the concept of text-based search to nontextual information. In particular, we elaborate on the possibilities of retrieving objects or scenes in a movie with the ease, speed, and accuracy with which Google retrieves web pages containing particular words, by specifying the query as an image of the object or scene. In our approach, each frame of the video is represented by a set of viewpoint invariant region descriptors. These descriptors enable recognition to proceed successfully despite changes in viewpoint, illumination, and partial occlusion. Vector quantizing these region descriptors provides a visual analogy of a word, which we term a ldquovisual word.rdquo Efficient retrieval is then achieved by employing methods from statistical text retrieval, including inverted file systems, and text and document frequency weightings. The final ranking also depends on the spatial layout of the regions. Object retrieval results are reported on the full length feature films ldquoGroundhog Day,rdquo ldquoCharade,rdquo and ldquoPretty Woman,rdquo including searches from within the movie and also searches specified by external images downloaded from the Internet. We discuss three research directions for the presented video retrieval approach and review some recent work addressing them: 1) building visual vocabularies for very large-scale retrieval; 2) retrieval of 3-D objects; and 3) more thorough verification and ranking using the spatial structure of objects.
Article
As digital libraries and video databases grow, we need methods to assist us in the synthesis and analysis of digital video. Since the information in video databases can be measured in thousands of gigabytes of uncompressed data, tools for efficient summarizing and indexing of video sequences are indispensable. In this paper, we present a method for effective classification of different types of videos that makes use of video summarization that is the form of a storyboard of keyframes. To produce the summarization, we first generate a universal basis on which to project a video frame that effectively reduces any video to the same lighting conditions. Each frame is represented by a compressed chromaticity signature. We then set out a multi-stage hierarchical clustering method to efficiently summarize a video. Finally we classify TV videos using a trained hidden Markov model on the compressed chromaticity signatures and also temporal features of videos that are represented by their summaries.
Flexible, High Performance Convolutional Neural Networks for Image Classification
  • D Ciresfan
  • U Meier
  • J Masci
  • L Gambardella
  • J Schmidhuber
Web Crawling. Foundations and Trends in Information Retrieval
Dark Web: Exploring and Data Mining the Dark Side of the Web
  • H Chen