Conference Paper

COST292 experimental framework for TRECVID 2006

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper we give an overview of the four TRECVID tasks submitted by COST292, European network of institutions in the area of semantic multimodal analysis and retrieval of digital video media. Initially, we present shot boundary evaluation method based on results merged using a confidence measure. The two SB detectors user here are presented, one of the Technical University of Delft and one of the LaBRI, University of Bordeaux 1, followed by the description of the merging algorithm. The high-level feature extraction task comprises three separate systems. The first system, developed by the National Technical University of Athens (NTUA) utilises a set of MPEG-7 low-level descriptors and Latent Semantic Analysis to detect the features. The second system, developed by Bilkent University, uses a Bayesian classifier trained with a “bag of subregions ” for each keyframe. The third system by the Middle East Technical University (METU) exploits textual information in the video using character recognition methodology. The system submitted to the search task is an interactive retrieval application developed by Queen Mary, University of London, University of Zilina and ITI from Thessaloniki, combining basic retrieval functionalities in various modalities (i.e. visual, audio, textual) with a user interface supporting the submission of queries using any combination of the available retrieval tools and the accumulation of relevant retrieval results over all queries submitted by a single user during a specified time interval. Finally, the rushes task submission comprises a video summarisation and browsing system specifically designed to intuitively and efficiently presents rushes material in video production environment. This system is a result of joint work of University of

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Our aim is to find an efficient way to present a preview of rushes in the form of a video summary. Some parts of the proposed method have been already exploited in the COST292 approach [1] submitted to the TRECVID 2008 campaign on rushes summarisation [2]. According to TRECVID, the final summary should contain only the relevant parts, where undesiderable content has been removed and only one take of each scene is shown. ...
... Several techniques have been proposed to deal with rushes summarisation ([3], [4], [5], and [1] ). Some approaches compute the informativeness of each segment and accelerate the playback if the information is low [5] . ...
... The framework proposed to create an essential preview of rushes is shown inFigure 2 . The shot detection and the removal of junk frames have been automatically obtained as described in [1] . This work focuses on the key-frame extraction , the clustering algorithm and the selection of the most representative shots (i.e., the blocks highlighted inFigure 2). ...
Conference Paper
Full-text available
This paper focuses on a specific type of unedited video content, called rushes, which are used for movie editing and usually present a high-level of redundancy. Our goal is to automatically extract a summarized preview, where redundant material is diminished without discarding any important event. To achieve this, rushes content has been first analysed and modeled. Then different clustering techniques on shot key-frames are presented and compared in order to choose the best representative segments to enter the preview. Experiments performed on TRECVID data are evaluated by computing the mutual information between the obtained results and a manually annotated ground-truth.
... Video search in academia, mainly under TRECVID dataset [26][27][28][29][30][31][32]35], moves one-step ahead, which takes content analysis, especially semantic concept annotation, into consideration. Machine learning techniques are applied to convert visual information into textual description to realize content-based video search. ...
Article
Full-text available
Existing video search engines have not taken the advantages of video content analysis and semantic understanding. Video search in academia uses semantic annotation to approach content-based indexing. We argue this is a promising direction to enable real content-based video search. However, due to the complexity of both video data and semantic concepts, existing techniques on automatic video annotation are still not able to handle large-scale video set and large-scale concept set, in terms of both annotation accuracy and computation cost. To address this problem, in this paper, we propose a scalable framework for annotation-based video search, as well as a novel approach to enable large-scale semantic concept annotation, that is, online multi-label active learning. This framework is scalable to both the video sample dimension and concept label dimension. Large-scale unlabeled video samples are assumed to arrive consecutively in batches with an initial pre-labeled training set, based on which a preliminary multi-label classifier is built. For each arrived batch, a multi-label active learning engine is applied, which automatically selects and manually annotates a set of unlabeled sample-label pairs. And then an online learner updates the original classifier by taking the newly labeled sample-label pairs into consideration. This process repeats until all data are arrived. During the process, new labels, even without any pre-labeled training samples, can be incorporated into the process anytime. Experiments on TRECVID dataset demonstrate the effectiveness and efficiency of the proposed framework.
... In the following subsections, we present our approach and implementation, in order to tackle the high-level concept detection problem from a different and at the same time innovative aspect, i.e., the one based on a region thesaurus and corresponding region types [69]. This research effort was expanded and further strengthened within [52], [53], [70], [72], and [73], by exploiting visual context in the process and by achieving promising research results. Our main focus remains to provide an ad-hoc " ontological " knowledge representation containing both high-level features (i.e., high-level concepts) and lower-level features and exploit it towards efficient multimedia analysis. ...
Article
Full-text available
In this paper we investigate detection of high-level concepts in multimedia content through an integrated approach of visual thesaurus analysis and visual context. In the former, detection is based on model vectors that represent image composition in terms of region types, obtained through clustering over a large data set. The latter deals with two aspects, namely high-level concepts and region types of the thesaurus, employing a model of a priori specified semantic relations among concepts and automatically extracted topological relations among region types; thus it combines both conceptual and topological context. A set of algorithms is presented, which modify either the confidence values of detected concepts, or the model vectors based on which detection is performed. Visual context exploitation is evaluated on TRECVID and Corel data sets and compared to a number of related visual thesaurus approaches.
... Therefore, this year our group has submitted results to four tasks: high-level feature extraction, search, rushes and copy detection. Based on our submissions to TRECVID 2006 and 2007, we have tried to improve and enrich our algorithms and systems according the previous experience [2] [3]. The following sections bring details of applied algorithms and their evaluations. ...
Conference Paper
Full-text available
Abstract In this paper, we give an overview of the four tasks submitted to TRECVID 2008 by COST292. The high-level feature extraction framework comprises four systems. The first system transforms a set of low-level descriptors into the semantic space using Latent Semantic Analysis and utilises neural networks for feature detection. The second system uses a multi-modal classifier based on SVMs and several descriptors. The third system uses three image classifiers based on ant colony optimisation, particle swarm,optimisation and a multi-objective learning algorithm. The fourth system uses a Gaussian model for singing detection and a person detection algorithm. The search task is based on an interactive retrieval application combining retrieval functionalities in various modalities with a user interface supporting automatic and interactive search over all queries submitted. The rushes task submission is based on a spectral clustering approach for removing similar scenes based on eigenvalues of frame similarity matrix and and a redundancy removal strategy which depends on semantic features extraction such as camera motion and faces. Finally, the submission to the copy detection task is conducted by two dierent,systems. The first system consists of a video module and an audio module. The second system is based on mid-level features that are related to the temporal structure of videos. Q. Zhang, K. Chandramouli, U. Damnjanovic, T. Piatrik and E. Izquierdo are with Dept. of Elec-
... Being one of the major evaluation activities in the area, TRECVID [1] has always been a target initiative for ITI- CERTH. In the past, ITI-CERTH participated in the search task under the research network COST292 (TRECVID 2006, 2007 and 2008 [2] [3] [4]) and in the semantic indexing (SIN) task (which is the similar to the old high-level feature extraction task) under MESH integrated project [5] (TRECVID 2008 [6]), K-SPACE project [7] (TRECVID 2007 and 2008 [8] [9]). Recently, ITI-CERTH has participated as stand alone organization in the search and high level feature tasks of TRECVID 2009 [10]. ...
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2010 by ITI-CERTH. ITI- CERTH participated in the Known-item search (KIS) and Instance search (INS) tasks, as well as in the Semantic Indexing (SIN) and the Event Detection in Internet Multimedia (MED) tasks. In the SIN task, techniques are developed, which combine motion information with existing well-performing descriptors such as SIFT and Bag-of-Words for shot representation. In the MED task, trained concept detectors are used to represent video sources with model vector sequences, while a dimensionality reduction method is used to derive a discriminant subspace for recognizing events. The KIS and INS search tasks are performed with by employing VERGE, which is an interactive retrieval application combining retrieval functionalities in various modalities (i.e. textual, visual and concept search). Evaluation results on the submitted runs for the aforementioned tasks provide interesting conclusions regarding the performance of the involved techniques and algorithms.
... Change detection and segmentation are the first steps of many signal processing applications (see, e.g., speech processing [1][2][3][4], video tracking [5], ergonomics [6], biomedical applications [7][8][9], seismic applications [10]). Most detection and segmentation algorithms are based on the theory of statistical detection and hypothesis testing [10][11][12]. ...
Article
Full-text available
CUSUM (cumulative sum) is a well-known method that can be used to detect changes in a signal when the parameters of this signal are known. This paper presents an adaptation of the CUSUM-based change detection algorithms to long-term signal recordings where the various hypotheses contained in the signal are unknown. The starting point of the work was the dynamic cumulative sum (DCS) algorithm, previously developed for application to long-term electromyography (EMG) recordings. DCS has been improved in two ways. The first was a new procedure to estimate the distribution parameters to ensure the respect of the detectability property. The second was the definition of two separate, automatically determined thresholds. One of them (lower threshold) acted to stop the estimation process, the other one (upper threshold) was applied to the detection function. The automatic determination of the thresholds was based on the Kullback-Leibler distance which gives information about the distance between the detected segments (events). Tests on simulated data demonstrated the efficiency of these improvements of the DCS algorithm.
Conference Paper
Full-text available
This paper describes the ITI interactive video retrieval system.
Article
Full-text available
This paper presents the Shot Boundary Detection system developed by LaBRI in the context of "Rough Indexing" paradigm. We work on compressed streams and we use only I and P frames information, (DC coefficients of I-Frames, motion vectors of P- Frames and DC coefficients of prediction error) which allow us to be faster than many equivalent systems (10 times faster than real-time on TRECVID2003 test set, and 3 times faster on 2004, because MPEG files structure is composed of only I and P frames). In this context the application was not developed to classify shot change transition effects, the initial goal was to allow a real-time and intelligent browsing in video content for common users. The detection is performed in two stages: - Robust Global Camera Motion Estimation - Detection of P-Frame peaks (computation of motion and frame statistics), and of I-Frames (measuring similarity on successive compensated I frames). As we work with two types of frames (I and P), we associate two statistical models which give us two sets of ratio and threshold to calibrate the detector. The first TRECVID participation of LaBRI implies an evolution of the application for transitions effects distinction, which induces two new thresholds to calibrate. We generally obtain equivalent values of Recall and Precision (0.72 on TRECVID 2003 test set). On TRECVID 2004 test set we obtain as best runs ri-3: 0.723(Recall) and 0.606(Precision); and ri-4: 0.703(Recall) and 0.635(Precision).
Article
Full-text available
This paper presents our camera motion detection method (pan, tilt and zoom) for TRECVID 2005. As input data, we only extract P-Frame motion compensation vectors directly from the MPEG compressed stream and we so achieve a performance of 3-4 times faster than real time. Our method is based on global camera motion estimation and a likelihood based significance test of the camera parameters. The best run (RI-3) on the TRECVID 2005 test set provides 0.912 for mean precision and 0.737 for mean recall.
Conference Paper
Full-text available
This paper presents a system designed for the management of multimedia databases that embarks upon the problem of efficient media processing and representation for automatic semantic classification and modelling. Its objectives are founded on the integration of a large-scale wildlife digital media archive with a manually annotated semantic metadata organised in a structured taxonomy and media classification system. Novel techniques will be applied to temporal analysis, intelligent key-frame extraction, animal gait analysis, semantic modelling and audio classification. The system demonstrator will be developed as a part of an ICBR project within the 3C Research programme of convergent technology research for digital media processing and communications.
Article
Full-text available
This paper describes a proposed algorithm for speech/music discrimination, which works on data directly taken from MPEG encoded bitstream thus avoiding the computationally difficult decoding-encoding process. The method is based on thresholding of features derived from the modulation envelope of the frequency-limited audio signal. The discriminator is tested on more than 2 hours of audio data, which contain clean and noisy speech from several speakers and a variety of music content. The discriminator is able to work in real time and despite its simplicity, results are very promising.
Conference Paper
Full-text available
Low-level features are now becoming insufficient to build efficient content-based retrieval systems. The interest of users is not any more to retrieve visually similar content, but they expect retrieval systems to find documents with similar semantic content. Bridging the gap between low-level features and semantic content is a challenging task necessary for future retrieval systems. Latent semantic indexing (LSI) was successfully introduced to efficiently index text documents. We propose to adapt this technique to efficiently represent the visual content of video shots for semantic content detection. Although we restrict our approach to visual features, it can be extended with minor changes to audio and motion features to build a multi-modal system. The semantic content is then detected thanks to two classifiers: k-nearest neighbors and neural network classifiers. Finally, in the experimental section we show the performances of each classifier and the performance gain obtained with LSI features compared to traditional features.
Conference Paper
Full-text available
A novel approach to speech-music discrimination based on rhythm (or beat) detection is introduced. Rhythmic pulses are detected by applying a long-term autocorrelation method on band-passed signals. This approach is combined with another, in which the features describe the energy peaks of the signal. The discriminator uses just three features that are computed from data directly taken from an MPEG-1 bitstream. The discriminator was tested on more than 3 hours of audio data. Average recognition rate is 97.7%.
Article
Full-text available
This paper looks into a new direction in video content analysis - the representation and modeling of affective video content . The affective content of a given video clip can be defined as the intensity and type of feeling or emotion (both are referred to as affect) that are expected to arise in the user while watching that clip. The availability of methodologies for automatically extracting this type of video content will extend the current scope of possibilities for video indexing and retrieval. For instance, we will be able to search for the funniest or the most thrilling parts of a movie, or the most exciting events of a sport program. Furthermore, as the user may want to select a movie not only based on its genre, cast, director and story content, but also on its prevailing mood, the affective content analysis is also likely to contribute to enhancing the quality of personalizing the video delivery to the user. We propose in this paper a computational framework for affective video content representation and modeling. This framework is based on the dimensional approach to affect that is known from the field of psychophysiology. According to this approach, the affective video content can be represented as a set of points in the two-dimensional (2-D) emotion space that is characterized by the dimensions of arousal (intensity of affect) and valence (type of affect). We map the affective video content onto the 2-D emotion space by using the models that link the arousal and valence dimensions to low-level features extracted from video data. This results in the arousal and valence time curves that, either considered separately or combined into the so-called affect curve, are introduced as reliable representations of expected transitions from one feeling to another along a video, as perceived by a viewer.
Article
Full-text available
This paper presents an overview of color and texture descriptors that have been approved for the Final Committee Draft of the MPEG-7 standard. The color and texture descriptors that are described in this paper have undergone extensive evaluation and development during the past two years. Evaluation criteria include effectiveness of the descriptors in similarity retrieval, as well as extraction, storage, and representation complexities. The color descriptors in the standard include a histogram descriptor that is coded using the Haar transform, a color structure histogram, a dominant color descriptor, and a color layout descriptor. The three texture descriptors include one that characterizes homogeneous texture regions and another that represents the local edge distribution. A compact descriptor that facilitates texture browsing is also defined. Each of the descriptors is explained in detail by their semantics, extraction and usage. The effectiveness is documented by experimental results
Article
Full-text available
This technical note describes straightforward techniques for document indexing and retrieval that have been solidly established through extensive testing and are easy to apply. They are useful for many different types of text material, are viable for very large files, and have the advantage that they do not require special skills or training for searching, but are easy for end users. The document and text retrieval methods described here have a sound theoretical basis, are well established by extensive testing, and the ideas involved are now implemented in some commercial retrieval systems. Testing in the last few years has, in particular, shown that the methods presented here work very well with full texts, not only title and abstracts, and with large files of texts containing three quarters of a million documents. These tests, the TREC Tests (see Harman 1993 - 1997; IP&M 1995), have been rigorous comparative evaluations involving many different approaches to information retrieval. ...
Article
Abstract, In order to represent large amount of infor-mation in form of a video key-frame summary, this paper studies narrative grammar of a comic strip genre and using its natural and inherent sematic rules, lays out summaries in an efficient and user centered way. In addition, a robust real-time algorithm for key-frame extraction is described and the evaluation results are presented.
Article
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.
Conference Paper
An approach for content-based image retrieval with relevance feedback based on a structured multi-feature space is proposed. It uses a novel kernel for merging multiple feature subspaces into a complementary space. The kernel exploits nature of the data by assigning appropriate weights for each feature set. The weights are dynamically adapted to user preferences in a relevance feedback scenario.
Conference Paper
Low-level video analysis is an important step for further semantic interpretation of the video. This provides information about the camera work, video editing process, shape, texture, color and topology of the objects and the scenes captured by the camera. Here we introduce a framework capable of extracting the information about the shot boundaries and the camera and object motion, based on the analysis of spatiotemporal pixel blocks in a series of video frames. Extracting the motion information and detecting shot boundaries using the same underlying principle is the main contribution of this paper. Besides, this original principle is likely to improve robustness of the abovementioned low-level video analysis as it avoids typical problems of standard frame-based approaches and the camera motion information provides critical help to improve the shot boundary detection performance. The system is evaluated using TRECVID data [1] with promising results.
Article
In this paper, the two different applications based on the Schema Reference System that were developed by the SCHEMA NoE for participation to the search task of TRECVID 2004 are illustrated. The first application, named ”Schema-Text”, is an interactive retrieval application that employs only textual information while the second one, named ”Schema-XM”, is an extension of the former, employing algorithms and methods for combining textual, visual and higher level information. Two runs for each application were submitted, I A 2 SCHEMA-Text 3, I A 2 SCHEMA-Text 4 for Schema-Text and I A 2 SCHEMA-XM 1, I A 2 SCHEMA-XM 2 for Schema-XM. The comparison of these two applications in terms of retrieval efficiency revealed that the combination of information from different data sources can provide higher efficiency for retrieval systems. Experimental testing additionally revealed that initially performing a text-based query and subsequently proceeding with visual similarity search using one of the returned relevant keyframes as an example image is a good scheme for combining visual and textual information.
Article
A new kernel for structured multi-feature spaces is introduced. It exploits the diversity of information encapsulated in different features. The mathematical validity of the introduced kernel is proven in the context of a conventional convex optimisation problem for support vector machines. Computer simulations show high performance for classification of images.
Article
Content-based image retrieval (CBIR) has become one of the most active research areas in the past few years. Many visual feature representations have been explored and many systems built. While these research efforts establish the basis of CBIR, the usefulness of the proposed approaches is limited. Specifically, these efforts have relatively ignored two distinct characteristics of CBIR systems: (1) the gap between high-level concepts and low-level features, and (2) the subjectivity of human perception of visual content. This paper proposes a relevance feedback based interactive retrieval approach, which effectively takes into account the above two characteristics in CBIR. During the retrieval process, the user's high-level query and perception subjectivity are captured by dynamically updated weights based on the user's feedback. The experimental results over more than 70000 images show that the proposed approach greatly reduces the user's effort of composing a query, and captures the user's information need more precisely
Article
A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as normbased distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms. 1 Introduction One of the crucial ingredients of SVMs is the so-called kernel trick for the computation of dot products in high-dimensional feature spaces using simple functions defined on pairs of input patterns. This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent exam...
Combining textual and visual information processing for interactive video retrieval: Schema's participation in trecvid
  • V Mezaris
  • Doulaverakis
  • Bart Herrmann
  • Lehane
  • O' Noel
  • Connor
  • M Kompatsiaris
  • Strintzis
V Mezaris, H Doulaverakis, S Herrmann, Bart Lehane, Noel O'Connor, I Kompatsiaris, and M Strintzis. Combining textual and visual information processing for interactive video retrieval: Schema's participation in trecvid 2004. In TRECVID 2004 -Text REtrieval Conference TRECVID Workshop, MD, USA, 2004. National Institute of Standards and Technology.
Speech-music discrimination from mpeg-1 bitstream
  • R Jarina
  • Noel Murphy
  • O' Noel
  • Sean Connor
  • Marlow
R Jarina, Noel Murphy, Noel O'Connor, and Sean Marlow. Speech-music discrimination from mpeg-1 bitstream. In In: Kluev V.V and Mastorakis N.E. (Eds.), Advances in Signal Processing, Robotics and Communications, WSES Press, 2001, pp. 174-178. SSIP'01 -WSES International Conference on Speech, Signal and Image Processing, pages 174-178, 2001.
ICBRmultimedia management system for intelligent content based retrieval
  • J Ćalić
  • N Campbell
  • M Mirmehdi
  • B Thomas
  • R Laborde
  • S Porter
  • N Canagarajah
J.Ćalić, N. Campbell, M. Mirmehdi, B. Thomas, R. Laborde, S. Porter, and N. Canagarajah. ICBRmultimedia management system for intelligent content based retrieval. In International Conference on Image and Video Retrieval CIVR 2004, pages 601-609. Springer LNCS 3115, July 2004.
Comic-like layout of video summaries
  • J Ćalić
  • N Campbell
J.Ćalić and N. Campbell. Comic-like layout of video summaries. In Proc. of the 7th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2006), 2006.