Conference PaperPDF Available

ITI interactive video retrieval system


Abstract and Figures

This paper describes the ITI interactive video retrieval system.
Content may be subject to copyright.
ITI Interactive Video Retrieval System
Stefanos Vrochidis, Vasileios Mezaris and Ioannis Kompatsiaris
Informatics and Telematics Institute / Centre for Research and Technology Hellas
6th Km Charilaou-Thermi Road, 57001 Thermi-Thessaloniki, Greece
{stefanos, bmezaris, ikom}
This paper describes the ITI interactive video retrieval system.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Retrieval models.
General Terms
Algorithms, Design.
Search engine, retrieval, visual, hybrid, video, MPEG-7
The Search Engine implemented by ITI is based on an interactive
retrieval model for handling video resources, integrating different
search modules and supporting hybrid functionality by combining
information from visual and textual queries.
In general, the developed application is a hybrid interactive
retrieval system, combining basic retrieval functionalities with a
user friendly interface supporting the submission of queries and
the accumulation of relevant retrieval results.
The following basic retrieval modules are integrated in the
developed search application and are combined in order to
provide the hybrid functionality:
Visual similarity search module
Textual information processing module
The search application is built on web technologies, specifically
php, JavaScript and a mySQL database, providing a GUI for
performing retrieval experiments over the internet (Fig. 1). Using
this GUI, the user is allowed to employ any combination of the
supported retrieval functionalities to submit a query and view the
retrieval results ordered according to rank. Starting with a textual
query, the user can take advantage of the hybrid functionality by
subsequently submitting a visual similarity search based on the
initial textual results and obtain a set of results that combines
textual and visual information. The storage of the desirable video
shots is made possible using a storage structure that mimics the
functionality of the shopping cart found in electronic commerce
sites. This way, the user can repeat the search using different
queries each time without loosing relevant shots retrieved during
previous queries submitted by the same user during the allowed
time interval.
Figure 1. GUI of ITI Video Retrieval System
3.1 Visual Similarity Search
In the developed application, visual similarity search is realized
using MPEG-7 XM and its extensions.
The MPEG-7 XM supports two main functionalities, i.e. (a)
extraction of a standardized Descriptor for a collection of images
and (b) retrieval of images of a collection that are similar to a
given example, using the previously extracted standardized
Descriptor and a corresponding matching function. Employed
extensions to the MPEG-7 XM include the MultiImage module,
effectively combining more than one MPEG-7 descriptor, the XM
Server, which forces the original command-line MPEG-7 XM
software to constantly run as a process in the background so as
not to repeat decoding of binary descriptor files during each
query, and the Indexing module, which employs an indexing
structure to speed up query execution.
3.2 Textual Information Processing Module
Text query is based on the video shot audio information. The text
algorithm integrated in the search platform is the BM25
algorithm, which incorporates normalized document length, term
and collection frequency in order to produce matching score for
the document against the request.
[1] V. Mezaris, H. Doulaverakis, S. Herrmann, et. al, Combining
Textual and Visual Information Processing for Interactive
Video Retrieval, TRECVID 2004 Workshop, Gaithersburg
MD, USA, November 2004.
[2] J. Calic, P. Kramer, U. Naci, S. Vrochidis, et. al, COST292
experimental framework for TRECVID 2006, 4th TRECVID
Workshop, Gaithersburg, USA, November 2006.
Copyright is held by the author/owner(s).
CIVR'07, July 9-11, 2007, Amsterdam, The Netherlands.
Copyright 2007 ACM 978-1-59593-733-9/07/0007.
Full-text available
An ever-growing amount of digitized content urges libraries and archives to integrate new media types from a large number of origins such as publishers, record labels and film archives, into their existing collections. This is a challenging task, since the multimedia content itself as well as the associated metadata is inherently heterogeneous—the different sources lead to different data structures, data quality and trustworthiness. This paper presents the contentus approach towards an automated media processing chain for cultural heritage organizations and content holders. Our workflow allows for unattended processing from media ingest to availability thorough our search and retrieval interface. We aim to provide a set of tools for the processing of digitized print media, audio/visual, speech and musical recordings. Media specific functionalities include quality control for digitization of still image and audio/visual media and restoration of the most common quality issues encountered with these media. Furthermore, the contentus tools include modules for content analysis like segmentation of printed, audio and audio/visual media, optical character recognition (OCR), speech-to-text transcription, speaker recognition and the extraction of musical features from audio recordings, all aimed at a textual representation of information inherent within the media assets. Once the information is extracted and transcribed in textual form, media independent processing modules offer extraction and disambiguation of named entities and text classification. All contentus modules are designed to be flexibly recombined within a scalable workflow environment using cloud computing techniques. In the next step analyzed media assets can be retrieved and consumed through a search interface using all available metadata. The search engine combines Semantic Web technologies for representing relations between the media and entities such as persons, locations and organizations with a full-text approach for searching within transcribed information gathered through the preceding processing steps. The contentus unified search interface integrates text, images, audio and audio/visual content. Queries can be narrowed and expanded in an exploratory manner, search results can be refined by disambiguating entities and topics. Further, semantic relationships become not only apparent, but can also be navigated.
Conference Paper
In this paper we give an overview of the four TRECVID tasks submitted by COST292, European network of institutions in the area of semantic multimodal analysis and retrieval of digital video media. Initially, we present shot boundary evaluation method based on results merged using a confidence measure. The two SB detectors user here are presented, one of the Technical University of Delft and one of the LaBRI, University of Bordeaux 1, followed by the description of the merging algorithm. The high-level feature extraction task comprises three separate systems. The first system, developed by the National Technical University of Athens (NTUA) utilises a set of MPEG-7 low-level descriptors and Latent Semantic Analysis to detect the features. The second system, developed by Bilkent University, uses a Bayesian classifier trained with a “bag of subregions ” for each keyframe. The third system by the Middle East Technical University (METU) exploits textual information in the video using character recognition methodology. The system submitted to the search task is an interactive retrieval application developed by Queen Mary, University of London, University of Zilina and ITI from Thessaloniki, combining basic retrieval functionalities in various modalities (i.e. visual, audio, textual) with a user interface supporting the submission of queries using any combination of the available retrieval tools and the accumulation of relevant retrieval results over all queries submitted by a single user during a specified time interval. Finally, the rushes task submission comprises a video summarisation and browsing system specifically designed to intuitively and efficiently presents rushes material in video production environment. This system is a result of joint work of University of
In this paper, the two different applications based on the Schema Reference System that were developed by the SCHEMA NoE for participation to the search task of TRECVID 2004 are illustrated. The first application, named ”Schema-Text”, is an interactive retrieval application that employs only textual information while the second one, named ”Schema-XM”, is an extension of the former, employing algorithms and methods for combining textual, visual and higher level information. Two runs for each application were submitted, I A 2 SCHEMA-Text 3, I A 2 SCHEMA-Text 4 for Schema-Text and I A 2 SCHEMA-XM 1, I A 2 SCHEMA-XM 2 for Schema-XM. The comparison of these two applications in terms of retrieval efficiency revealed that the combination of information from different data sources can provide higher efficiency for retrieval systems. Experimental testing additionally revealed that initially performing a text-based query and subsequently proceeding with visual similarity search using one of the returned relevant keyframes as an example image is a good scheme for combining visual and textual information.
Combining Textual and Visual Information Processing for Interactive Video Retrieval
  • V Mezaris
  • H Doulaverakis
  • S Herrmann
V. Mezaris, H. Doulaverakis, S. Herrmann, et. al, Combining Textual and Visual Information Processing for Interactive Video Retrieval, TRECVID 2004 Workshop, Gaithersburg MD, USA, November 2004.