Conference PaperPDF Available

Abstract and Figures

This paper provides an overview of the tasks submitted to TRECVID 2013 by ITI-CERTH. ITI- CERTH participated in the Semantic Indexing (SIN), the Event Detection in Internet Multimedia (MED), the Multimedia Event Recounting (MER) and the Instance Search (INS) tasks. In the SIN task, techniques are developed, which combine new video representations (video tomographs) with existing well-performing descriptors such as SIFT, Bag-of-Words for shot representation, ensemble construction techniques and a multi-label learning method for score re�nement. In the MED task, an e�cient method that uses only static visual features as well as limited audio information is evaluated. In the MER sub-task of MED a discriminant analysis-based feature selection method is combined with a model vector approach for selecting the key semantic entities depicted in the video that best describe the detected event. Finally, the INS task is performed by employing VERGE, which is an in- teractive retrieval application combining retrieval functionalities in various modalities, used previously for supporting the Known Item Search (KIS) task.
Content may be subject to copyright.
A preview of the PDF is not available
... We experiment the inclusion of two Convolutional Neural Networks in our FDNN, namely AlexNet [27] and GoogLeNet [49], whose weights are learned from the largescale ImageNet dataset. We chose AlexNet and GoogLeNet because they were the best performing models in the Ima-geNet Large-Scale Visual Recognition Challenge (ILSVRC) [42] in years 2012 and 2014, respectively, and they have been widely used for image classification in the recent years (e.g., in [6,12,33,63]). AlexNet consists of 5 convolutional layers (one with 11 × 11, one with 5 × 5, three with 3 × 3 filter sizes), maxpooling layers, and 3 final fully connected layers. ...
... The approach that we used to extract concepts has been presented in [33] and produces two sets of concepts. The first one is given by a pre-trained GoogLeNet [49] network, using its 1000-dimensional output as concepts scores. ...
... International Journal of Multimedia Information Retrieval (2019) 8:[19][20][21][22][23][24][25][26][27][28][29][30][31][32][33] ...
Article
Full-text available
Exoticism is the charm of the unfamiliar or something remote. It has received significant interest in different kinds of arts, but although visual concept classification in images and videos for semantic multimedia retrieval has been researched for years, the visual concept of exoticism has not been investigated yet from a computational perspective. In this paper, we present the first approach to automatically classify images as exotic or non-exotic. We have gathered two large datasets that cover exoticism in a general as well as a concept-specific way. The datasets have been annotated in a crowdsourcing approach. To circumvent cultural differences in the annotation, only North American crowdworkers are employed for this task. Two deep learning architectures to learn the concept of exoticism are evaluated. Besides deep learning features, we also investigate the usefulness of hand-crafted features, which are combined with deep features in our proposed fusion-based approach. Different machine learning models are compared with the fusion-based approach, which is the best performing one, reaching an accuracy over 83% and 91% on two different datasets. Comprehensive experimental results provide insights into which features contribute at most to recognizing exoticism. The estimation of image exoticism could be applied in fields like advertising and travel suggestions, as well as to increase serendipity and diversity of recommendations and search results.
... Typical concept annotation and retrieval methods use image/video exemplars as training materials to develop pre-defined concept detectors [20,21,22]. However, these methods suffer from scalability limitations because it is difficult to collect and annotate large enough datasets. ...
... For the SIN'13 dataset, we compared with the work presented in [21] and the CERTH participation in the TRECVID SIN task in 2013 [22]. For the SIN'15 dataset, we offer a comparison with the CERTH participation in the TRECVID SIN task in 2015 [20]. Table 3 shows that our approach is very competitive, even though the baselines are explicitly designed for the SIN task. ...
Preprint
Full-text available
Migration, and especially irregular migration, is a critical issue for border agencies and society in general. Migration-related situations and decisions are influenced by various factors, including the perceptions about migration routes and target countries. An improved understanding of such factors can be achieved by systematic automated analyses of media and social media channels, and the videos and images published in them. However, the multifaceted nature of migration and the variety of ways migration-related aspects are expressed in images and videos make the finding and automated analysis of migration-related multimedia content a challenging task. We propose a novel approach that effectively bridges the gap between a substantiated domain understanding - encapsulated into a set of Migration-related semantic concepts - and the expression of such concepts in a video, by introducing an advanced video analysis and retrieval method for this purpose.
... We experiment the inclusion of two Convolutional Neural Networks in our FDNN, namely AlexNet [26] and GoogLeNet [45], whose weights are learned from the large-scale ImageNet dataset. We chose AlexNet and GoogLeNet because they were the best performing models in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [38] in years 2012 and 2014, respectively, and they have been widely used for image classification in the recent years (e.g. in [8,11,31,56]). AlexNet consists of 5 convolutional layers (one with 11 × 11, one with 5 × 5, three with 3 × 3 filter sizes), max-pooling layers, and 3 final fully connected layers. ...
... The approach that we used to extract concepts has been presented in [31] and produces two sets of concepts. The first one is given by a pre-trained GoogLeNet [45] network, using its 1000-dimensional output as concepts scores. ...
Conference Paper
Full-text available
Exoticism is the charm of the unfamiliar, it often means unusual, mystery, and it can evoke the atmosphere of remote lands. Although it has received interest in different arts, like painting and music, no study has been conducted on understanding exoticism from a computational perspective. To the best of our knowledge, this work is the first to explore the problem of exoticism-aware image classification, aiming at automatically measuring the amount of exoticism in images and investigating the significant aspects of the task. The estimation of image exoticism could be applied in fields like advertising and travel suggestion, as well as to increase serendipity and diversity of recommendations and search results. We propose a Fusion-based Deep Neural Network (FDNN) for this task, which combines image representations learned by Deep Neural Networks with visual and semantic hand-crafted features. Comparisons with other Machine Learning models show that our proposed architecture is the best performing one, reaching accuracy over 83% and 91% on two different datasets. Moreover, experiments with classifiers exploiting both visual and semantic features allow to analyze what are the most important aspects for identifying exotic content. Ground truth has been gathered by retrieving exotic and not exotic images through a web search engine by posing queries with exotic and not exotic semantics, and then assessing the exoticism of the retrieved images via a crowdsourcing evaluation. The dataset is publicly released to promote advances in this novel field.
Article
Full-text available
People with Intellectual Disability (ID) encounter several problems in their daily living regarding their needs, activities, interrelationships, and communication. In this paper, an interactive platform is proposed, aiming to provide personalized recommendations for information and entertainment, including creative and educational activities, tailored to the special user needs of this population. Furthermore, the proposed platform integrates capabilities for the automatic recognition of health-related emergencies, such as fever, oxygen saturation decline, and tachycardia, as well as location tracking and detection of wandering behavior based on smartwatch/smartphone sensors, while providing appropriate notifications to caregivers and automated assistance to people with ID through voice instructions and interaction with a virtual assistant. A short-scale pilot study has been carried out, where a group of end-users participated in the testing of the integrated platform, verifying its effectiveness concerning the recommended services. The experimental results indicate the potential value of the proposed system in providing routine health measurements, identifying and managing emergency cases, and supporting a creative and qualitative daily life for people with disabilities.
Chapter
Text-video retrieval attracts growing attention recently. A dominant approach is to learn a common space for aligning two modalities. However, video deliver richer content than text in general situations and captions usually miss certain events or details in the video. The information imbalance between two modalities makes it difficult to align their representations. In this paper, we propose a general framework, LINguistic ASsociation (LINAS), which utilizes the complementarity between captions corresponding to the same video. Concretely, we first train a teacher model taking extra relevant captions as inputs, which can aggregate language semantics for obtaining more comprehensive text representations. Since the additional captions are inaccessible during inference, Knowledge Distillation is employed to train a student model with a single caption as input. We further propose Adaptive Distillation strategy, which allows the student model to adaptively learn the knowledge from the teacher model. This strategy also suppresses the spurious relations introduced during the linguistic association. Extensive experiments demonstrate the effectiveness and efficiency of LINAS with various baseline architectures on benchmark datasets. Our code is available at https://github.com/silenceFS/LINAS.KeywordsText-video retrievalKnowledge distillation
Chapter
Migration, and especially irregular migration, is a critical issue for border agencies and society in general. Migration-related situations and decisions are influenced by various factors, including the perceptions about migration routes and target countries. An improved understanding of such factors can be achieved by systematic automated analyses of media and social media channels, and the videos and images published in them. However, the multifaceted nature of migration and the variety of ways migration-related aspects are expressed in images and videos make the finding and automated analysis of migration-related multimedia content a challenging task. We propose a novel approach that effectively bridges the gap between a substantiated domain understanding - encapsulated into a set of Migration-related semantic concepts - and the expression of such concepts in a video, by introducing an advanced video analysis and retrieval method for this purpose.
Article
Classification of video events based on frame-level descriptors is a common approach to video recognition. In the meanwhile, proper encoding of the frame-level descriptors is vital to the whole event classification procedure. While there are some pretty efficient video descriptor encoding methods, temporal ordering of the descriptors is often ignored in these encoding algorithms. In this paper, we show that by taking into account the temporal inter-frame dependencies and tracking the chronological order of video sub-events, accuracy of event recognition is further improved. First, the frame-level descriptors are extracted using convolutional neural networks (CNNs) pre-trained on ImageNet, which are fine-tuned on a portion of training video frames. Then, a spatio-temporal encoding is applied to the derived descriptors. The proposed spatio-temporal encoding, as the main contribution of this work, is inspired from the well-known vector of locally aggregated descriptors (VLAD) encoding in spatial domain and from total variation de-noising (TVD) in temporal domain. The proposed unified spatio-temporal encoding is then shown to be in the form of a convex optimization problem which is solved efficiently with alternating direction method of multipliers (ADMM) algorithm. The experimental results show superiority of the proposed encoding method in terms of recognition accuracy over both frame-level video encoding approaches and spatio-temporal video representations. As compared to the state-of-the-art approaches, our encoding method improves the mean average precision (mAP) over both Columbia consumer video (CCV), unstructured social activity attribute (USAA), YouTube-8M, and Kinetics datasets and is very competitive on FCVID dataset.
Chapter
Modern newsroom tools offer advanced functionality for automatic and semi-automatic content collection from the web and social media sources to accompany news stories. However, the content collected in this way often tends to be unstructured and may include irrelevant items. An important step in the verification process is to organize this content, both with respect to what it shows, and with respect to its origin. This chapter presents our efforts in this direction, which resulted in two components. One aims to detect semantic concepts in video shots, to help annotation and organization of content collections. We implement a system based on deep learning, featuring a number of advances and adaptations of existing algorithms to increase performance for the task. The other component aims to detect logos in videos in order to identify their provenance. We present our progress from a keypoint-based detection system to a system based on deep learning.
Article
Full-text available
In this work we deal with the problem of video concept detection, for the purpose of using the concept detection results towards more effective concept-based video retrieval. The key novelties of this work are: 1) The use of spatio-temporal video slices (tomographs) in the same way that visual keyframes are typically used in video concept detection schemes. These spatio-temporal slices capture in a compact way motion patterns that are useful for detecting semantic concepts and are used for training a number of base detectors. The latter augment the set of keyframe-based base detectors that can be trained using different frame representations. 2) The introduction of a generic methodology, built upon a genetic algorithm, for controlling which subset of the available base detectors (consequently, which subset of the possible shot representations) should be combined for developing an optimal detector for each specific concept. This methodology is directly applicable to the learning of hundreds of diverse concepts, while diverging from the “one size fits all” approach that is typically used in problems of this size. The proposed techniques are evaluated on the datasets of the 2011 and 2012 Semantic Indexing Task of TRECVID, each comprising several hundred hours of heterogeneous video clips and groundtruth annotations for tens of concepts that exhibit significant variation in terms of generality, complexity, human participation. The experimental results manifest the merit of the proposed techniques.
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2012 by ITI-CERTH. ITICERTH participated in the Known-item search (KIS), in the Semantic Indexing (SIN), as well as in the Event Detection in Internet Multimedia (MED) and the Multimedia Event Recounting (MER) tasks. In the SIN task, techniques are developed, which combine video representations that express motion semantics with existing well-performing descriptors such as SIFT and Bag-of-Words for shot representation. In the MED task, two methods are evaluated, one that is based on Gaussian mixture models (GMM) and audio features, and a \semantic model vector approach that combines a pool of subclass kernel support vector machines (KSVMs) in an ECOC framework for event detection exploiting visual information only. Furthermore, we investigate fusion strategies of the two systems in an intermediate semantic level or in score level (late fusion). In the MER task, a \model vector approach is used to describe the semantic content of the videos, similar to the MED task, and a novel feature selection method is utilized to select the most discriminant concepts regarding the target event. Finally, the KIS search task is performed by employing VERGE, which is an interactive retrieval application combining retrieval functionalities in various modalities.
Conference Paper
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Article
Multi-label learning deals with the problem where each example is represented by a single instance while associated with multiple class labels. A number of multi-label learning approaches have been proposed recently, among which multi-label lazy learning methods have shown to yield good generalization abilities. Existing multi-label learning algorithm based on lazy learning techniques does not address the correlations between different labels of each example, such that the performance of the algorithm could be negatively influenced. In this paper, an improved multi-label lazy learning approach named IMLLA is proposed. Given a test example, IMLLA works by firstly identifying its neighboring instances in the training set for each possible class. After that, a label counting vector is generated from those neighboring instances and fed to the trained linear classifiers. In this way, information embedded in other classes is involved in the process of predicting the label of each class, so that the inter-label relationships of each example are appropriately addressed. Experiments are conducted on several synthetic data sets and two benchmark real-world data sets regarding natural scene classification and yeast gene functional analysis. Experimental results show that the performance of IMLLA is superior to other well-established multi-label learning algorithms, including one of the state-of-the-art lazy-style multi-label leaner.
Conference Paper
In this paper, a new feature selection method is used, in combination with a semantic model vector video representation, in order to enumerate the key semantic evidences of an event in a video signal. In particular, a set of semantic concept detectors is firstly used for estimating a model vector for each video signal, where each element of the model vector denotes the degree of confidence that the respective concept is depicted in the video. Then, a novel feature selection method is learned for each event of interest. This method is based on exploiting the first two eigenvectors derived using the eigenvalue formulation of the mixture subclass discriminant analysis. Subsequently, given a video-event pair, the proposed method jointly evaluates the significance of each concept for the detection of the given event and the degree of confidence with which this concept is detected in the given video, in order to decide which concepts provide the strongest evidence in support of the provided video-event link. Experimental results using a video collection of TRECVID demonstrate the effectiveness of the proposed video event recounting method.
Conference Paper
Exploiting concept correlations is a promising way for boosting the performance of concept detection systems, aiming at concept based video indexing or annotation. Stacking approaches, which can model the correlation information, appear to be the most commonly used techniques to this end. This paper performs a comparative study and proposes an improved way of employing stacked models, by using multi-label classification methods in the last level of the stack. The experimental results on the TRECVID 2011 and 2012 semantic indexing task datasets show the effectiveness of the proposed framework compared to existing works. In addition to this, as part of our comparative study, we investigate whether the evaluation of concept detection results at the level of individual concepts, as is typically the case in the literature, is appropriate for assessing the usefulness of concept detection results in both video indexing applications and in the somewhat different problem of video annotation.
Article
In this paper, a theoretical link between mixture subclass discriminant analysis (MSDA) and a restricted Gaussian model is first presented. Then, two further discriminant analysis (DA) methods, i.e., fractional step MSDA (FSMSDA) and kernel MSDA (KMSDA) are proposed. Linking MSDA to an appropriate Gaussian model allows the derivation of a new DA method under the expectation maximization (EM) framework (EM-MSDA), which simultaneously derives the discriminant subspace and the maximum likelihood estimates. The two other proposed methods generalize MSDA in order to solve problems inherited from conventional DA. FSMSDA solves the subclass separation problem, that is, the situation in which the dimensionality of the discriminant subspace is strictly smaller than the rank of the inter-between-subclass scatter matrix. This is done by an appropriate weighting scheme and the utilization of an iterative algorithm for preserving useful discriminant directions. On the other hand, KMSDA uses the kernel trick to separate data with nonlinearly separable subclass structure. Extensive experimentation shows that the proposed methods outperform conventional MSDA and other linear discriminant analysis variants.
Article
Given the exponential growth of videos published on the Internet, mechanisms for clustering, searching, and browsing large numbers of videos have become a major research area. More importantly, there is a demand for event detectors that go beyond the simple finding of objects but rather detect more abstract concepts, such as "feeding an animal" or a "wedding ceremony". This article presents an approach for event classification that enables searching for arbitrary events, including more abstract concepts, in found video collections based on the analysis of the audio track. The approach does not rely on speech processing, and is language-indepent, instead it generates models for a set of example query videos using a mixture of two types of audio features: Linear-Frequency Cepstral Coefficients and Modulation Spectrogram Features. This approach can be used in complement with video analysis and requires no domain specific tagging. Application of the approach to the TRECVid MED 2011 development set, which consists of more than 4000 random "wild" videos from the Internet, has shown a detection accuracy of 64% including those videos which do not contain an audio track.