Conference PaperPDF Available

Abstract and Figures

This paper provides an overview of the tasks submitted to TRECVID 2013 by ITI-CERTH. ITI- CERTH participated in the Semantic Indexing (SIN), the Event Detection in Internet Multimedia (MED), the Multimedia Event Recounting (MER) and the Instance Search (INS) tasks. In the SIN task, techniques are developed, which combine new video representations (video tomographs) with existing well-performing descriptors such as SIFT, Bag-of-Words for shot representation, ensemble construction techniques and a multi-label learning method for score re�nement. In the MED task, an e�cient method that uses only static visual features as well as limited audio information is evaluated. In the MER sub-task of MED a discriminant analysis-based feature selection method is combined with a model vector approach for selecting the key semantic entities depicted in the video that best describe the detected event. Finally, the INS task is performed by employing VERGE, which is an in- teractive retrieval application combining retrieval functionalities in various modalities, used previously for supporting the Known Item Search (KIS) task.
Content may be subject to copyright.
A preview of the PDF is not available
... We experiment the inclusion of two Convolutional Neural Networks in our FDNN, namely AlexNet [27] and GoogLeNet [49], whose weights are learned from the largescale ImageNet dataset. We chose AlexNet and GoogLeNet because they were the best performing models in the Ima-geNet Large-Scale Visual Recognition Challenge (ILSVRC) [42] in years 2012 and 2014, respectively, and they have been widely used for image classification in the recent years (e.g., in [6,12,33,63]). AlexNet consists of 5 convolutional layers (one with 11 × 11, one with 5 × 5, three with 3 × 3 filter sizes), maxpooling layers, and 3 final fully connected layers. ...
... The approach that we used to extract concepts has been presented in [33] and produces two sets of concepts. The first one is given by a pre-trained GoogLeNet [49] network, using its 1000-dimensional output as concepts scores. ...
... International Journal of Multimedia Information Retrieval (2019) 8:[19][20][21][22][23][24][25][26][27][28][29][30][31][32][33] ...
Article
Full-text available
Exoticism is the charm of the unfamiliar or something remote. It has received significant interest in different kinds of arts, but although visual concept classification in images and videos for semantic multimedia retrieval has been researched for years, the visual concept of exoticism has not been investigated yet from a computational perspective. In this paper, we present the first approach to automatically classify images as exotic or non-exotic. We have gathered two large datasets that cover exoticism in a general as well as a concept-specific way. The datasets have been annotated in a crowdsourcing approach. To circumvent cultural differences in the annotation, only North American crowdworkers are employed for this task. Two deep learning architectures to learn the concept of exoticism are evaluated. Besides deep learning features, we also investigate the usefulness of hand-crafted features, which are combined with deep features in our proposed fusion-based approach. Different machine learning models are compared with the fusion-based approach, which is the best performing one, reaching an accuracy over 83% and 91% on two different datasets. Comprehensive experimental results provide insights into which features contribute at most to recognizing exoticism. The estimation of image exoticism could be applied in fields like advertising and travel suggestions, as well as to increase serendipity and diversity of recommendations and search results.
... Typical concept annotation and retrieval methods use image/video exemplars as training materials to develop pre-defined concept detectors [20,21,22]. However, these methods suffer from scalability limitations because it is difficult to collect and annotate large enough datasets. ...
... For the SIN'13 dataset, we compared with the work presented in [21] and the CERTH participation in the TRECVID SIN task in 2013 [22]. For the SIN'15 dataset, we offer a comparison with the CERTH participation in the TRECVID SIN task in 2015 [20]. Table 3 shows that our approach is very competitive, even though the baselines are explicitly designed for the SIN task. ...
Preprint
Full-text available
Migration, and especially irregular migration, is a critical issue for border agencies and society in general. Migration-related situations and decisions are influenced by various factors, including the perceptions about migration routes and target countries. An improved understanding of such factors can be achieved by systematic automated analyses of media and social media channels, and the videos and images published in them. However, the multifaceted nature of migration and the variety of ways migration-related aspects are expressed in images and videos make the finding and automated analysis of migration-related multimedia content a challenging task. We propose a novel approach that effectively bridges the gap between a substantiated domain understanding - encapsulated into a set of Migration-related semantic concepts - and the expression of such concepts in a video, by introducing an advanced video analysis and retrieval method for this purpose.
... We experiment the inclusion of two Convolutional Neural Networks in our FDNN, namely AlexNet [26] and GoogLeNet [45], whose weights are learned from the large-scale ImageNet dataset. We chose AlexNet and GoogLeNet because they were the best performing models in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [38] in years 2012 and 2014, respectively, and they have been widely used for image classification in the recent years (e.g. in [8,11,31,56]). AlexNet consists of 5 convolutional layers (one with 11 × 11, one with 5 × 5, three with 3 × 3 filter sizes), max-pooling layers, and 3 final fully connected layers. ...
... The approach that we used to extract concepts has been presented in [31] and produces two sets of concepts. The first one is given by a pre-trained GoogLeNet [45] network, using its 1000-dimensional output as concepts scores. ...
Conference Paper
Full-text available
Exoticism is the charm of the unfamiliar, it often means unusual, mystery, and it can evoke the atmosphere of remote lands. Although it has received interest in different arts, like painting and music, no study has been conducted on understanding exoticism from a computational perspective. To the best of our knowledge, this work is the first to explore the problem of exoticism-aware image classification, aiming at automatically measuring the amount of exoticism in images and investigating the significant aspects of the task. The estimation of image exoticism could be applied in fields like advertising and travel suggestion, as well as to increase serendipity and diversity of recommendations and search results. We propose a Fusion-based Deep Neural Network (FDNN) for this task, which combines image representations learned by Deep Neural Networks with visual and semantic hand-crafted features. Comparisons with other Machine Learning models show that our proposed architecture is the best performing one, reaching accuracy over 83% and 91% on two different datasets. Moreover, experiments with classifiers exploiting both visual and semantic features allow to analyze what are the most important aspects for identifying exotic content. Ground truth has been gathered by retrieving exotic and not exotic images through a web search engine by posing queries with exotic and not exotic semantics, and then assessing the exoticism of the retrieved images via a crowdsourcing evaluation. The dataset is publicly released to promote advances in this novel field.
... This module indexes the video shots based on 1000 ImageNet concepts, 345 TRECVID SIN concepts, 500 event-related concepts, and 205 place-related concepts [5]. To obtain scores regarding the 1000 ImageNet concepts, we applied five pre-trained ImageNet deep convolutional neural networks (DCNNs) on the AVS test keyframes [5]. ...
... This module indexes the video shots based on 1000 ImageNet concepts, 345 TRECVID SIN concepts, 500 event-related concepts, and 205 place-related concepts [5]. To obtain scores regarding the 1000 ImageNet concepts, we applied five pre-trained ImageNet deep convolutional neural networks (DCNNs) on the AVS test keyframes [5]. The output of these networks was averaged in terms of arithmetic mean to obtain a single score for each of the 1000 concepts. ...
... Paper Method Results Our Weighted Class Specific Dictionary Learning Scheme 98.93% [14] Multi Region two stream R-CNN 95.74% [1] Bag of Visual words 90.90% [23] Dense Trajectories and motion boundary descriptor 88.00% [3] CNN + Rank Pooling 87.20% [9] hk-means and TF-IDF scoring based vocabulary construction 78.40% scheme outperforms other mentioned methods in term of average accuracy. We significantly improve the recognition performance for UCF Sports dataset presented in Peng at al. [14] by approximately 3%. ...
Chapter
Human action recognition has become a popular field for computer vision researchers in the recent decade. This paper presents a human action recognition scheme based on a textual information concept inspired by document retrieval systems. Videos are represented using a commonly used local feature representation. In addition, we formulate a new weighted class specific dictionary learning scheme to reflect the importance of visual words for a particular action class. Weighted class specific dictionary learning enriches the scheme to learn a sparse representation for a particular action class. To evaluate our scheme on realistic and complex scenarios, we have tested it on UCF Sports and UCF11 benchmark datasets. This paper reports experimental results that outperform recent state-of-the-art methods for the UCF Sports and the UCF11 dataset i.e. 98.93% and 93.88% in terms of average accuracy respectively. To the best of our knowledge, this contribution is first to apply a weighted class specific dictionary learning method on realistic human action recognition datasets.
... Paper Method Results Our Weighted Class Specific Dictionary Learning Scheme 98.93% [14] Multi Region two stream R-CNN 95.74% [1] Bag of Visual words 90.90% [23] Dense Trajectories and motion boundary descriptor 88.00% [3] CNN + Rank Pooling 87.20% [9] hk-means and TF-IDF scoring based vocabulary construction 78.40% scheme outperforms other mentioned methods in term of average accuracy. We significantly improve the recognition performance for UCF Sports dataset presented in Peng at al. [14] by approximately 3%. ...
Conference Paper
Human action recognition has become a popular field for computer vision researchers in the recent decade. This paper presents a human action recognition scheme based on a textual information concept inspired by document retrieval systems. Videos are represented using a commonly used local feature representation. In addition, we formulate a new weighted class specific dictionary learning scheme to reflect the importance of visual words for a particular action class. Weighted class specific dictionary learning enriches the scheme to learn a sparse representation for a particular action class. To evaluate our scheme on realistic and complex scenarios, we have tested it on UCF Sports and UCF11 benchmark datasets. This paper reports experimental results that out- perform recent state-of-the-art methods for the UCF Sports and the UCF11 dataset i.e. 98.93% and 93.88% in terms of average accuracy respectively. To the best of our knowledge, this contribution is first to apply a weighted class specific dictionary learning method on realistic human action recognition datasets.
Chapter
Migration, and especially irregular migration, is a critical issue for border agencies and society in general. Migration-related situations and decisions are influenced by various factors, including the perceptions about migration routes and target countries. An improved understanding of such factors can be achieved by systematic automated analyses of media and social media channels, and the videos and images published in them. However, the multifaceted nature of migration and the variety of ways migration-related aspects are expressed in images and videos make the finding and automated analysis of migration-related multimedia content a challenging task. We propose a novel approach that effectively bridges the gap between a substantiated domain understanding - encapsulated into a set of Migration-related semantic concepts - and the expression of such concepts in a video, by introducing an advanced video analysis and retrieval method for this purpose.
Article
Classification of video events based on frame-level descriptors is a common approach to video recognition. In the meanwhile, proper encoding of the frame-level descriptors is vital to the whole event classification procedure. While there are some pretty efficient video descriptor encoding methods, temporal ordering of the descriptors is often ignored in these encoding algorithms. In this paper, we show that by taking into account the temporal inter-frame dependencies and tracking the chronological order of video sub-events, accuracy of event recognition is further improved. First, the frame-level descriptors are extracted using convolutional neural networks (CNNs) pre-trained on ImageNet, which are fine-tuned on a portion of training video frames. Then, a spatio-temporal encoding is applied to the derived descriptors. The proposed spatio-temporal encoding, as the main contribution of this work, is inspired from the well-known vector of locally aggregated descriptors (VLAD) encoding in spatial domain and from total variation de-noising (TVD) in temporal domain. The proposed unified spatio-temporal encoding is then shown to be in the form of a convex optimization problem which is solved efficiently with alternating direction method of multipliers (ADMM) algorithm. The experimental results show superiority of the proposed encoding method in terms of recognition accuracy over both frame-level video encoding approaches and spatio-temporal video representations. As compared to the state-of-the-art approaches, our encoding method improves the mean average precision (mAP) over both Columbia consumer video (CCV), unstructured social activity attribute (USAA), YouTube-8M, and Kinetics datasets and is very competitive on FCVID dataset.
Chapter
Modern newsroom tools offer advanced functionality for automatic and semi-automatic content collection from the web and social media sources to accompany news stories. However, the content collected in this way often tends to be unstructured and may include irrelevant items. An important step in the verification process is to organize this content, both with respect to what it shows, and with respect to its origin. This chapter presents our efforts in this direction, which resulted in two components. One aims to detect semantic concepts in video shots, to help annotation and organization of content collections. We implement a system based on deep learning, featuring a number of advances and adaptations of existing algorithms to increase performance for the task. The other component aims to detect logos in videos in order to identify their provenance. We present our progress from a keypoint-based detection system to a system based on deep learning.
Conference Paper
Full-text available
In daily life, it is common that viewers want to quickly browse scenes with their idols in TV series. In 2016, the TRECVID INS (Instance Search) task started to focus on identifying a specific target person in a target location. In this paper, we name this kind of task as P-S (Person-Scene) Instance Retrieval. As we know, most approaches handle this task by separately obtaining the person instance and the scene instance retrieval results, and directly combining them together. However, we find that the person and scene instance retrieval modules are not always effective at the same time, which will decrease the accuracy if the results are aggregated directly. To solve this problem, we attempt to achieve the results in two steps. (1) Early Elimination. There are many noisy data making person/scene instance retrieval score solely high, such as the occluded person or scene shots. Corresponding scores of these shots should be eliminated rather than calculated with noise. (2) Late Expansion. Considering the video»s continuity, person or scene in adjacent shots is likely to be the same one, hence we try to expand the results of those eliminated shots. On this basis, we propose an early elimination and late expansion method to improve the accuracy of P-S Instance Retrieval. Experimental results on the large-scale TRECVID INS dataset demonstrate the effectiveness of the proposed method.
Article
Full-text available
In this work we deal with the problem of video concept detection, for the purpose of using the concept detection results towards more effective concept-based video retrieval. The key novelties of this work are: 1) The use of spatio-temporal video slices (tomographs) in the same way that visual keyframes are typically used in video concept detection schemes. These spatio-temporal slices capture in a compact way motion patterns that are useful for detecting semantic concepts and are used for training a number of base detectors. The latter augment the set of keyframe-based base detectors that can be trained using different frame representations. 2) The introduction of a generic methodology, built upon a genetic algorithm, for controlling which subset of the available base detectors (consequently, which subset of the possible shot representations) should be combined for developing an optimal detector for each specific concept. This methodology is directly applicable to the learning of hundreds of diverse concepts, while diverging from the “one size fits all” approach that is typically used in problems of this size. The proposed techniques are evaluated on the datasets of the 2011 and 2012 Semantic Indexing Task of TRECVID, each comprising several hundred hours of heterogeneous video clips and groundtruth annotations for tens of concepts that exhibit significant variation in terms of generality, complexity, human participation. The experimental results manifest the merit of the proposed techniques.
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2012 by ITI-CERTH. ITICERTH participated in the Known-item search (KIS), in the Semantic Indexing (SIN), as well as in the Event Detection in Internet Multimedia (MED) and the Multimedia Event Recounting (MER) tasks. In the SIN task, techniques are developed, which combine video representations that express motion semantics with existing well-performing descriptors such as SIFT and Bag-of-Words for shot representation. In the MED task, two methods are evaluated, one that is based on Gaussian mixture models (GMM) and audio features, and a \semantic model vector approach that combines a pool of subclass kernel support vector machines (KSVMs) in an ECOC framework for event detection exploiting visual information only. Furthermore, we investigate fusion strategies of the two systems in an intermediate semantic level or in score level (late fusion). In the MER task, a \model vector approach is used to describe the semantic content of the videos, similar to the MED task, and a novel feature selection method is utilized to select the most discriminant concepts regarding the target event. Finally, the KIS search task is performed by employing VERGE, which is an interactive retrieval application combining retrieval functionalities in various modalities.
Conference Paper
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Article
Multi-label learning deals with the problem where each example is represented by a single instance while associated with multiple class labels. A number of multi-label learning approaches have been proposed recently, among which multi-label lazy learning methods have shown to yield good generalization abilities. Existing multi-label learning algorithm based on lazy learning techniques does not address the correlations between different labels of each example, such that the performance of the algorithm could be negatively influenced. In this paper, an improved multi-label lazy learning approach named IMLLA is proposed. Given a test example, IMLLA works by firstly identifying its neighboring instances in the training set for each possible class. After that, a label counting vector is generated from those neighboring instances and fed to the trained linear classifiers. In this way, information embedded in other classes is involved in the process of predicting the label of each class, so that the inter-label relationships of each example are appropriately addressed. Experiments are conducted on several synthetic data sets and two benchmark real-world data sets regarding natural scene classification and yeast gene functional analysis. Experimental results show that the performance of IMLLA is superior to other well-established multi-label learning algorithms, including one of the state-of-the-art lazy-style multi-label leaner.
Conference Paper
In this paper, a new feature selection method is used, in combination with a semantic model vector video representation, in order to enumerate the key semantic evidences of an event in a video signal. In particular, a set of semantic concept detectors is firstly used for estimating a model vector for each video signal, where each element of the model vector denotes the degree of confidence that the respective concept is depicted in the video. Then, a novel feature selection method is learned for each event of interest. This method is based on exploiting the first two eigenvectors derived using the eigenvalue formulation of the mixture subclass discriminant analysis. Subsequently, given a video-event pair, the proposed method jointly evaluates the significance of each concept for the detection of the given event and the degree of confidence with which this concept is detected in the given video, in order to decide which concepts provide the strongest evidence in support of the provided video-event link. Experimental results using a video collection of TRECVID demonstrate the effectiveness of the proposed video event recounting method.
Conference Paper
Exploiting concept correlations is a promising way for boosting the performance of concept detection systems, aiming at concept based video indexing or annotation. Stacking approaches, which can model the correlation information, appear to be the most commonly used techniques to this end. This paper performs a comparative study and proposes an improved way of employing stacked models, by using multi-label classification methods in the last level of the stack. The experimental results on the TRECVID 2011 and 2012 semantic indexing task datasets show the effectiveness of the proposed framework compared to existing works. In addition to this, as part of our comparative study, we investigate whether the evaluation of concept detection results at the level of individual concepts, as is typically the case in the literature, is appropriate for assessing the usefulness of concept detection results in both video indexing applications and in the somewhat different problem of video annotation.
Article
In this paper, a theoretical link between mixture subclass discriminant analysis (MSDA) and a restricted Gaussian model is first presented. Then, two further discriminant analysis (DA) methods, i.e., fractional step MSDA (FSMSDA) and kernel MSDA (KMSDA) are proposed. Linking MSDA to an appropriate Gaussian model allows the derivation of a new DA method under the expectation maximization (EM) framework (EM-MSDA), which simultaneously derives the discriminant subspace and the maximum likelihood estimates. The two other proposed methods generalize MSDA in order to solve problems inherited from conventional DA. FSMSDA solves the subclass separation problem, that is, the situation in which the dimensionality of the discriminant subspace is strictly smaller than the rank of the inter-between-subclass scatter matrix. This is done by an appropriate weighting scheme and the utilization of an iterative algorithm for preserving useful discriminant directions. On the other hand, KMSDA uses the kernel trick to separate data with nonlinearly separable subclass structure. Extensive experimentation shows that the proposed methods outperform conventional MSDA and other linear discriminant analysis variants.
Article
Given the exponential growth of videos published on the Internet, mechanisms for clustering, searching, and browsing large numbers of videos have become a major research area. More importantly, there is a demand for event detectors that go beyond the simple finding of objects but rather detect more abstract concepts, such as "feeding an animal" or a "wedding ceremony". This article presents an approach for event classification that enables searching for arbitrary events, including more abstract concepts, in found video collections based on the analysis of the audio track. The approach does not rely on speech processing, and is language-indepent, instead it generates models for a set of example query videos using a mixture of two types of audio features: Linear-Frequency Cepstral Coefficients and Modulation Spectrogram Features. This approach can be used in complement with video analysis and requires no domain specific tagging. Application of the approach to the TRECVid MED 2011 development set, which consists of more than 4000 random "wild" videos from the Internet, has shown a detection accuracy of 64% including those videos which do not contain an audio track.