Conference PaperPDF Available

VERGE: An Interactive Search Engine for Browsing Video


Abstract and Figures

This paper presents VERGE interactive video retrieval engine, which is capable of searching and browsing video content. The system integrates several content-based analysis and retrieval modules such as video shot segmentation and scene detection, concept detection, hierarchical clustering and visual similarity search into a user friendly interface that supports the user in browsing through the collection, in order to retrieve the desired clip.
Content may be subject to copyright.
VERGE: An Interactive Search Engine for Browsing
Video Collections
Anastasia Moumtzidou1, Konstantinos Avgerinakis1, Evlampios Apostolidis1, Vera
Aleksic2, Fotini Markatopoulou1, Christina Papagiannopoulou1, Stefanos Vrochidis1,
Vasileios Mezaris1, Reinhard Busch2, Ioannis Kompatsiaris1
1Information Technologies Institute/Centre for Research and Technology Hellas,
6th Km. Xarilaou - Thermi Road, 57001 Thermi-Thessaloniki, Greece
{moumtzid, koafgeri, apostolid, markatopoulou, cppapagi, stefanos,
bmezaris, ikom}
2Linguatec Sprachtechnologien GmbH, Gottfried-Keller-Str. 12, 81245 München
{v.aleksic, r.busch}
Abstract. This paper presents VERGE interactive video retrieval engine, which
is capable of searching and browsing video content. The system integrates sev-
eral content-based analysis and retrieval modules such as video shot segmenta-
tion and scene detection, concept detection, clustering and visual similarity
search into a user friendly interface that supports the user in browsing through
the collection, in order to retrieve the desired clip.
1 Introduction
This paper describes VERGE interactive video search engine1, which is capable of
retrieving and browsing video by integrating different indexing and retrieval modules.
VERGE supports the Known Item Search task, which requires the incorporation of
techniques for browsing and navigation within a video collection. VERGE was evalu-
ated with participation in workshops and showcases such as TRECVID and
VideOlympics, where it was shown to significantly improve user search experience
over single or fewer search modalities. Specifically, VERGE demonstrated the best
results in the interactive known item search of TRECVID 2011 by achieving a Mean
Inverted Rank of 0.56, while the concept detectors of VERGE achieved good balance
between detection accuracy (e.g. 15.8% MXinfAP for TRECVID 2013) and low
computational complexity.
The proposed version of VERGE aims at participating to the KIS task of the Video
Browser Showdown (VBS) Competition [1]. In this context, VERGE supports inter-
active searching of a known video clip in a large video collection by incorporating
content-based analysis and interactive retrieval techniques.
In the next sections we present the content-based analysis and retrieval techniques
supported by VERGE, as well as the interaction with the user.
1 More information and demos of VERGE are available at:
2 Video Retrieval System
VERGE is a retrieval system, which combines advanced retrieval functionalities with
a user-friendly interface. The following basic modules are integrated: a) Shot and
Scene Segmentation; b) Textual Information Processing Module; c) Visual Similarity
Search; d) High Level Concept Detection; e) Clustering.
The shot segmentation module performs shot segmentation by extracting visual
features, namely color coherence, Macbeth color histogram and luminance center of
gravity, and forming a corresponding feature vector per frame [2]. Then, given a pair
of frames, the distances between their vectors are computed, composing distance vec-
tors that are finally evaluated using one or more SVM classifiers, resulting to the de-
tection of both abrupt and gradual transitions between the shots of the video.
Scene segmentation is based on the previous analysis and groups shots into sets
that correspond to individual scenes of the video. The algorithm [3] introduces and
combines two extensions of the Scene Transition Graph (STG); the first one aims to
reduce the computational cost by considering shot linking transitivity, while the se-
cond one constructs a probabilistic framework towards combining multiple STGs.
The textual information processing module applies Automatic Speech Recogni-
tion (ASR) on videos. We employ the VPE (Voice Pro Enterprise) framework, which
is based on RWTH-ASR technology [4]. Finally, each shot is described by a set of
words, which are used to create a taxonomy to facilitate browsing of the collection.
The visual similarity search module performs content-based retrieval based on
global and local information. To deal with global information, MPEG-7 descriptors
are extracted from each keyframe and they are concatenated into a single feature vec-
tor. Efficient retrieval is achieved by employing the r-tree indexing structure. In the
case of local information SURF features are extracted. We apply two Bag of Visual
Words techniques for representing and retrieving images efficiently. On the first, we
calculate visual vocabularies via hierarchical k-means clustering [5], while on the
second, we follow K-Means clustering and VLAD encoding for representing images.
The high level concept retrieval module indexes the video shots based on 346
high level concepts (e.g. water, aircraft). For each keyframe we employ up to 25 fea-
ture extraction procedures [6]. For learning these concepts, a bag of linear Support
Vector Machines (LSVM) is trained for each feature extraction procedure and each
concept. A sampling strategy is applied to partition the dataset into 5 subclasses and
for each subset a LSVM is trained. During the classification phase, a new unlabeled
video shot is given to the trained LSVMs, each of them returns the degree of confi-
dence that the concept is depicted in the image, and late fusion is used for combining
these scores.
Finally, the clustering module incorporates an agglomerative hierarchical cluster-
ing process [7], which provides a hierarchical view of the keyframes. In addition to
the feature vectors used as input to the high level concept retrieval module, we extract
vectors consisting of the responses of the trained concept detectors for each video
shot. The clustering algorithm is then applied to these representations in order to
group the keyframes into clusters, each of which consists of keyframes having visual-
ly or semantically similar content.
VERGE is built on Apache server, PHP, JavaScript and mySQL database. Besides
the aforementioned basic modules, VERGE integrates the following complementary
functionalities: a) basic temporal queries, b) shot storage structure and c) history bin.
3 Interaction Modes
The aforementioned modules aid the user to interact with the system through a user-
friendly interface (Figure 1), in order to discover the desired video clip during known
item search tasks. The interface comprises of three main components: a) the central
component, b) the left side and c) the upper panel.
Fig. 1. Screenshot of VERGE video retrieval engine
The central component of the interface includes a shot-based representation of the
video in a grid-like interface. In this grid each video shot is visualized by a repre-
sentative key frame. When the user hovers over an image, a pop-up frame appears
that contains a larger preview of the image in order to allow for better inspection of its
content, as well as several links that support:
browsing the temporally adjacent shots and all shots of the specific video
registering an image as relevant for a specific topic or query
searching for visually similar images to the given (query) image
On the left side of the interface, the search history, as well as additional search and
browsing options are displayed. The history module automatically records all search-
ing actions done by the user, while the search and browsing options include the tax-
onomy based on the ASR transcriptions, the high level visual concepts and the hierar-
chical clustering. Using the aforementioned functionalities, the user can browse the
dataset at shot and scene level taking also into account ASR and concept taxonomies.
Finally, the upper panel is a storage structure that mimics the functionality of the
shopping cart found in electronic commerce sites and holds the shots selected by the
user throughout the session.
Acknowledgements This work was partially supported by the European Commission
under contracts FP7-287911 LinkedTV, FP7-318101 MediaMixer and FP7-610411
1. Schoeffmann, K., Bailer, W.: Video Browser Showdown. ACM SIGMultimedia Records,
vol. 4, no. 2, pp. 1-2 (2012)
2. Tsamoura, E., Mezaris, V., Kompatsiaris, I.: Gradual transition detection using color co-
herence and other criteria in a video shot meta-segmentation framework. IEEE Internation-
al Conference on Image Processing, Workshop on Multimedia Information Retrieval
(ICIP-MIR 2008), pp. 45-48, San Diego, CA, USA (2008)
3. Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.:
Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans.
on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177 (2011)
4. Rybach, D., Gollan, C., Heigold, G., Hoffmeister, B., Lööf, J., Schlüter, R., Ney, H.: The
RWTH Aachen University Open Source Speech Recognition System. In: Interspeech, pp.
2111-2114, Brighton, UK (2009)
5. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of
the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recogni-
tion (CVPR ’06), vol. 2 (2006)
6. Moumtzidou, A., Gkalelis, N., Sidiropoulos, P., Dimopoulos, M., Nikolopoulos, S.,
Vrochidis, S., Mezaris, V., Kompatsiaris, I.: ITI-CERTH participation to TRECVID 2012.
Proc. TRECVID 2012 Workshop, Gaithersburg, MD, USA (2012)
7. Johnson, S.C.: Hierarchical Clustering Schemes. Psychometrika, vol. 2, pp. 241-254
... In order to make the exploratory search process in video more interactive researchers long identified the importance of giving user more control as well as more responsibilities [9]. Moumtzidou et al. [30] use several content-based analysis and retrieval modules and a custom user interface to support the user in browsing through a video collection. ...
Exploring the content of a video is typically inefficient due to the linear streamed nature of its media and the lack of interactivity.Video may be seen as a combination of a set of features, the visual track, the audio track and transcription of the spoken words,etc. These features may be viewed as a set of temporally bounded parallel modalities. It is our contention that together these modalities and derived features have the potential to be presented individually or in discrete combination, to allow deeper and effective content exploration within different parts of a video in an interactive manner. A novel system for video exploration by offering video content as an alternative representation is proposed. The proposed system represents the extracted multimodal features as an automatically generated interactive multimedia webpage. This paper also presents a user study conducted to learn its (proposed system) usage patterns. The learned usage patterns may be utilized to build a template driven representation engine that uses the features to offer a multimodal synopsis of video that may lead to efficient exploration of video content.
Conference Paper
In this paper, we present two approaches for known-item search in video databases with textual queries. In the first approach, we require the database objects to be labeled with an arbitrary ImageNet classification model. During the search, the set of query words is expanded with synonyms and hypernyms until we encounter words present in the database which are consequently searched for. In the second approach, we delegate the query to an independent database such as Google Images and let the user pick a suitable result for query-by-example search. Furthermore, the effectiveness of the proposed approaches is evaluated in a user study.
Conference Paper
In this paper, we present an effective yet efficient approach for known-item search in video data. The approach employs feature signatures based on color distribution to represent video key-frames. At the same time, the feature signatures enable users to intuitively draw simple colored sketches of the desired scene. We describe in detail the video retrieval model and also discuss and carefully optimize its parameters. Furthermore, several indexing techniques suitable for the model are presented and their performance is empirically evaluated in the experiments. Apart from that, we also investigate a bounding-sphere pruning technique suitable for similarity search in vector spaces.
Conference Paper
The success of our Signature-Based Video Browser presented last year at Video Browser Showdown 2014 (now renamed to Video Search Showcase) was mainly based on effective filtering using position-color feature signatures, while browsing in the results comprising matched keyframes was based just on a simple sequential search approach. Since the results can consist of highly similar keyframes (e.g., news studio scenes) making the browsing more difficult, we have enhanced our tool with more advanced browsing techniques considering also homogeneous result sets obtained after filtering phase. Furthermore, we have utilized improved search models based on feature signatures to make the filtering phase more effective.
Full-text available
Digital video enables manifold ways of multimedia content interaction. Over the last decade, many proposals for improving and enhancing video content interaction were published. More recent work particularly leverages on highly capable devices such as smartphones and tablets that embrace novel interaction paradigms, e.g. touch, gesture-based or physical content interaction. In this paper, we survey literature at the intersection of Human-Computer Interaction and Multimedia. We integrate literature from video browsing and navigation, direct video manipulation, video content visualization, as well as interactive video summariza-tion and interactive video retrieval. We classify the reviewed works by the underlying interaction method and discuss the achieved improvements so far. We also depict a set of open problems that the video interaction community should address in the next years.
Full-text available
The abrupt expansion of the Internet use over the last decade led to an uncontrollable amount of media stored in the Web. Image, video and news information has flooded the pool of data that is at our disposal and advanced data mining techniques need to be developed in order to take full advantage of them. The focus of this thesis is mainly on developing robust video analysis technologies concerned with detecting and recognizing activities in video. The work aims at developing a compact activity descriptor with low computational cost, which will be robust enough to discriminate easily among diverse activity classes. Additionally, we introduce a motion compensation algorithm which alleviates any issues introduced by moving camera and is used to create motion binary masks, referred to as compensated Activity Areas (cAA), where dense interest points are sampled. Motion and appearance descriptors invariant to scale and illumination changes are then computed around them and a thorough evaluation of their merit is carried out. The notion of Motion Boundaries Activity Areas (MBAA) is then introduced. The concept differs from cAA in terms of the area they focus on (ie human boundaries), reducing even more the computational cost of the activity descriptor. A novel algorithm that computes human trajectories, referred to as 'optimal trajectories', with variable temporal scale is introduced. It is based on the Statistical Sequential Change Detection (SSCD) algorithm, which allows dynamic segmentation of trajectories based on their motion pattern and facilitates their classification with better accuracy. Finally, we introduce an activity detection algorithm, which segments long duration videos in an accurate but computationally efficient manner. We advocate Statistical Sequential Boundary Detection (SSBD) method as a means of analysing motion patterns and report improvement over the State-of-the-Art.
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2012 by ITI-CERTH. ITICERTH participated in the Known-item search (KIS), in the Semantic Indexing (SIN), as well as in the Event Detection in Internet Multimedia (MED) and the Multimedia Event Recounting (MER) tasks. In the SIN task, techniques are developed, which combine video representations that express motion semantics with existing well-performing descriptors such as SIFT and Bag-of-Words for shot representation. In the MED task, two methods are evaluated, one that is based on Gaussian mixture models (GMM) and audio features, and a \semantic model vector approach that combines a pool of subclass kernel support vector machines (KSVMs) in an ECOC framework for event detection exploiting visual information only. Furthermore, we investigate fusion strategies of the two systems in an intermediate semantic level or in score level (late fusion). In the MER task, a \model vector approach is used to describe the semantic content of the videos, similar to the MED task, and a novel feature selection method is utilized to select the most discriminant concepts regarding the target event. Finally, the KIS search task is performed by employing VERGE, which is an interactive retrieval application combining retrieval functionalities in various modalities.
Full-text available
A general parametric scheme of hierarchical clustering procedures with invariance under monotone transformations of similarity values and invariance under numeration of objects is described. This scheme consists of two steps: correction of given similarity values between objects and transitive closure of obtained valued relation. Some theoretical properties of considered scheme are studied. Different parametric classes of clustering procedures from this scheme based on perceptions like “keep similarity classes,” “break bridges between clusters,” etc. are considered. Several examples are used to illustrate the application of proposed clustering procedures to analysis of similarity structures of data.
Conference Paper
Full-text available
Shot segmentation provides the basis for almost all high-level video content analysis approaches, validating it as one of the major prerequisites for efficient video semantic analysis, indexing and retrieval. The successful detection of both gradual and abrupt transitions is necessary to this end. In this paper a new gradual transition detection algorithm is proposed, that is based on novel criteria such as color coherence change that exhibit less sensitivity to local or global motion than previously proposed ones. These criteria, each of which could serve as a standalone gradual transition detection approach, are then combined using a machine learning technique, to result in a meta-segmentation scheme. Besides significantly improved performance, advantage of the proposed scheme is that there is no need for threshold selection, as opposed to what would be the case if any of the proposed features were used by themselves and as is typically the case in the relevant literature. Performance evaluation and comparison with four other popular algorithms reveals the effectiveness of the proposed technique.
Conference Paper
Full-text available
We announce the public availability of the RWTH Aachen Uni- versity speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model train- ing and decoding. Speaker adaptation, speaker adaptive train- ing, unsupervised training, a finite state automata library, and an efficient tree search decoder are notable components. Compre- hensive documentation, example setups for training and recog- nition, and a tutorial are provided to support newcomers. Index Terms: speech recognition, LVCSR, software
Full-text available
In this work a novel approach to video temporal decomposition into semantic units, termed scenes, is presented. In contrast to previous temporal segmentation approaches that employ mostly low-level visual or audiovisual features, we introduce a technique that jointly exploits low-level and high-level features automatically extracted from the visual and the auditory channel. This technique is built upon the well-known method of the Scene Transition Graph (STG), first by introducing a new STG approximation that features reduced computational cost, and then by extending the unimodal STG-based temporal segmentation technique to a method for multimodal scene segmentation. The latter exploits, among others, the results of a large number of TRECVID-type trained visual concept detectors and audio event detectors, and is based on a probabilistic merging process that combines multiple individual STGs while at the same time diminishing the need for selecting and fine-tuning several STG construction parameters. The proposed approach is evaluated on three test datasets, comprising TRECVID documentary films, movies, and news-related videos, respectively. The experimental results demonstrate the improved performance of the proposed approach in comparison to other unimodal and multimodal techniques of the relevant literature and highlight the contribution of high-level audiovisual features towards improved video segmentation to scenes.
The Video Browser Showdown (VBS) is a live video browsing competition where international researchers, working in the field of interactive video search, evaluate and demonstrate the efficiency of their tools in presence of the audience. The aim of the VBS is to evaluate video browsing tools for efficiency at known-item search (KIS) tasks with a well-defined data set in direct comparison to other tools.
Techniques for partitioning objects into optimally homogeneous groups on the basis of empirical measures of similarity among those objects have received increasing attention in several different fields. This paper develops a useful correspondence between any hierarchical system of such clusters, and a particular type of distance measure. The correspondence gives rise to two methods of clustering that are computationally rapid and invariant under monotonic transformations of the data. In an explicitly defined sense, one method forms clusters that are optimally “connected,” while the other forms clusters that are optimally “compact.”
A recognition scheme that scales efficiently to a large number of objects is presented. The efficiency and quality is exhibited in a live demonstration that recognizes CD-covers from a database of 40000 images of popular music CD’s. The scheme builds upon popular techniques of indexing descriptors extracted from local regions, and is robust to background clutter and occlusion. The local region descriptors are hierarchically quantized in a vocabulary tree. The vocabulary tree allows a larger and more discriminatory vocabulary to be used efficiently, which we show experimentally leads to a dramatic improvement in retrieval quality. The most significant property of the scheme is that the tree directly defines the quantization. The quantization and the indexing are therefore fully integrated, essentially being one and the same. The recognition quality is evaluated through retrieval on a database with ground truth, showing the power of the vocabulary tree approach, going as high as 1 million images.