Conference PaperPDF Available

Abstract and Figures

This paper presents VERGE interactive video retrieval engine, which is capable of searching into video content. The system integrates several content-based analysis and retrieval modules such as video shot boundary detection, concept detection, clustering and visual similarity search.
Content may be subject to copyright.
VERGE: A Multimodal Interactive Video Search Engine
Anastasia Moumtzidou1, Konstantinos Avgerinakis1, Evlampios Apostolidis1, Fotini
Markatopoulou1,2, Konstantinos Apostolidis1, Theodoros Mironidis1, Stefanos
Vrochidis1, Vasileios Mezaris1, Ioannis Kompatsiaris1, Ioannis Patras2
1Information Technologies Institute/Centre for Research and Technology Hellas,
6th Km. Charilaou - Thermi Road, 57001 Thermi-Thessaloniki, Greece
{moumtzid, koafgeri, apostolid, markatopoulou, kapost, mironidis,
stefanos, bmezaris, ikom}
2School of Electronic Engineering and Computer Science, QMUL, UK
Abstract. This paper presents VERGE interactive video retrieval engine, which
is capable of searching into video content. The system integrates several con-
tent-based analysis and retrieval modules such as video shot boundary detec-
tion, concept detection, clustering and visual similarity search.
1 Introduction
This paper describes VERGE interactive video search engine1, which is capable of
retrieving and browsing video collections by integrating multimodal indexing and
retrieval modules. VERGE supports Known Item Search task, which requires the
incorporation of browsing, exploration, or navigation capabilities in video collection.
Evaluation of earlier versions of VERGE search engine was performed with partic-
ipation in video retrieval related conferences and showcases such as TRECVID,
VideOlympics and Video Browser Showdown (VBS). Specifically, ITI-CERTH par-
ticipated in the TRECVID Search tasks in 2006, 2007, 2008, and 2009, in the Known
Item Search (KIS) task in 2010, 2011, and 2012, in the Instance Search (INS) task in
2010, 2011, 2013 and 2014 and in the VideOlympics event in 2007, 2008 and 2009.
VERGE has also participated in VBS 2014. The proposed version of VERGE aims at
participating to the KIS task of the Video Search Showcase (VSS) Competition 2015
which was formerly known as Video Browser Showdown [1].
2 Video Retrieval System
VERGE is an interactive retrieval system that combines advanced retrieval function-
alities with a user-friendly interface, and supports the submission of queries and the
accumulation of relevant retrieval results. The following indexing and retrieval mod-
1 More information and demos of VERGE are available at:
ules are integrated in the developed search application: a) Visual Similarity Search
Module; b) High Level Concept Detection; and c) Hierarchical Clustering.
The aforementioned modules allow the user to search through a collection of imag-
es and/or video keyframes. However, in the case of a video collection, it is essential
that the videos are pre-processed in order to be indexed in smaller segments and se-
mantic information should be extracted. The modules that are applied for segmenting
videos are: a) Shot Segmentation; and b) Scene Segmentation;
Thus, the general framework realized by VERGE in case of video collection is de-
picted in Figure 1. This framework contains all the aforementioned segmenting and
indexing modules.
Fig. 1. Framework of VERGE
2.1 Shot segmentation
The video temporal decomposition module defines the shot segments of the video,
i.e., video fragments composed by consecutive frames captured uninterruptedly from
a single camera, based on a variation of the algorithm proposed in [2]. The utilized
technique represents the visual content of the each frame by extracting an HSV histo-
gram and a set of ORB (Oriented FAST and Rotated BRIEF) descriptors (introduced
in [3]), being able to detect the differences between a pair of frames, both in color
distribution and at a more fine-grained structure level. Then both abrupt and gradual
transitions are detected by quantifying the change in the content of successive or
neighboring frames of the video, and comparing it against experimentally specified
thresholds that indicate the existence of abrupt and gradual shot transitions. Errone-
ously detected abrupt transitions are removed by applying a flash detector, while false
alarms are filtered out after re-evaluating the defined gradual transitions with the help
of a dissolve and a wipe detector that rely on the algorithms introduced in [4] and [5]
respectively. Finally, a simple fusion approach (i.e. taking the union of the detected
abrupt and gradual transitions) is used for forming the output of the algorithm.
2.2 Scene segmentation
Drawing input from the analysis in section 2.1, the scene segmentation algorithm of
[6] defines the story-telling parts of the video, i.e., temporal segments covering either
a single event or several related events taking place in parallel, by grouping shots into
sets that correspond to individual scenes of the video. For this, content similarity (i.e.,
visual similarity assessed by comparing HSV histograms extracted from the
keyframes of each shot) and temporal consistency among shots are jointly considered
during the grouping of the shots into scenes, with the help of two extensions of the
well-known Scene Transition Graph (STG) algorithm [7]. The first one reduces the
computational cost of STG-based shot grouping by considering shot linking transitivi-
ty and the fact that scenes are by definition convex sets of shots, while the second one
builds on the former to construct a probabilistic framework that alleviates the need for
manual STG parameter selection. Based on these extensions, and as reported in [6],
the employed technique can identify the scene-level structure of videos belonging to
different genres, and provide results that match well the human expectations, while
the needed time for processing is a very small fraction (<3%) of the video's duration.
2.3 Visual Similarity Search
The visual similarity search module performs content-based retrieval based on global
and local information. To deal with global information, MPEG-7 descriptors are ex-
tracted from each keyframe and they are concatenated into a single feature vector.
More specifically, the colour related descriptors Colour Structure (CS), Colour Lay-
out (CL) and Scalable Colour (SC) are used. Regarding the case of local information,
SURF features are extracted. Then K-Means clustering is applied on the database
vectors in order to acquire the visual vocabulary and VLAD encoding for representing
images is realized [8].
For Nearest Neighbour search we implement three different approaches between
the query and database vectors that is described in [8]. In each case, an index is first
constructed for database vectors and K-Nearest Neighbours are then computed from
the query file. An index of lower-dimensional PCA-projected VLAD vectors, an ADC
index and an IVFADC index were constructed from the database vectors as in [8].
Exhaustive search is deployed in the first two cases using a Symmetric Distance
Computation (SDC) and Asymmetric Distance Computation (ADC) for Nearest
Neighbour calculation, while a faster solution is suggested in the third one, where an
inverted file system is combined with ADC instead. Based on the experiments real-
ized in [9], the approach that performs the best is IVFADC. Therefore, in our imple-
mentation, we will apply this approach but we are going to investigate the other two
methods as well. It should be noted that this indexing structure is utilized for both
descriptors (i.e. global and local). Finally, a web service is implemented in order to
accelerate the querying process.
2.4 High Level Concepts Retrieval Module
This module indexes the video shots based on 346 high level concepts (e.g. water,
aircraft). To build concept detectors a two-layer concept detection system is em-
ployed. The first layer builds multiple independent concept detectors. The video
stream is initially sampled, generating for instance one keyframe per shot by shot
segmentation. Subsequently, each sample is represented using one or more types of
appropriate local descriptors (e.g. SIFT, RGB-SIFT, SURF, ORB etc.). The de-
scriptors are extracted in more than one square regions at different scale levels. All
the local descriptors are compacted using PCA and are subsequently aggregated using
the VLAD encoding. These VLAD vectors are compressed by applying a modifica-
tion of the random projection matrix [10] and served as input to Logistic Regression
(LR) classifiers. Following the bagging methodology of [11] five LR classifiers are
trained per concept and per local descriptor (SIFT, RGB-SIFT, SURF, ORB etc.), and
their output is combined by means of late fusion (averaging). When different de-
scriptors are combined, again late fusion is performed by averaging of the classifier
output scores. In the second layer of the stacking architecture, the fused scores from
the first layer are aggregated in model vectors and refined by two different approach-
es. The first approach uses a multi-label learning algorithm that incorporates concept
correlations [12]. The second approach is a temporal re-ranking method that re-
evaluates the detection scores based on video segments as proposed in [13].
2.5 Hierarchical Clustering
This module incorporates a generalized agglomerative hierarchical clustering process
[14], which provides a structured hierarchical view of the video keyframes. In addi-
tion to the feature vectors described in section 2.3, we extract vectors consisting of the
responses of the concept detectors for each video shot. The hierarchical clustering is
applied to these representations to cluster the keyframes into classes, each of which
consists of keyframes of similar content, in line with the concepts provided.
3 VERGE Interface and Interaction Modes
The modules described in section 2 are incorporated into a friendly user interface
(Figure 2) in order to aid the user to interact with the system, discover and retrieve the
desired video clip. The existence of a friendly and smartly designed graphical inter-
face (GUI) plays a vital role in the procedure. Within this context, the GUI of VERGE
has been redesigned in order to improve the user experience. The new interface, simi-
larly to older one, comprises of three main components: a) the central component, b)
the left side, c) the lower panel. We have incorporated the aforementioned modules
inside these components, in order to allow the user interact with the system and re-
trieve the desired video clip during known item search tasks. In the sequel, we de-
scribe briefly the three main components of the VERGE system and then present a
simple usage scenario.
Fig. 2. Screenshot of VERGE video retrieval engine
The central component of the interface includes a shot-based or scene-based repre-
sentation of the video in a grid-like interface. When the user hovers over a shot
keyframe, the shot preview is visible by rolling three to five different keyframes that
constitute the shot. Moreover, when the user clicks on a shot a pop-up frame appears
that contains a larger preview of the image and several links that support her in view-
ing adjacent shots or all video shots, the frames constituting the shot and in searching
for visually similar images. On the left side of the interface, the search history, as well
as additional search and browsing options (that include the high level visual concepts
and the hierarchical clustering) are displayed. Finally, the lower panel is a storage
structure that holds the shots selected by the user.
Regarding the usage scenario for the known-item task, we suppose that a user is in-
terested in finding a clip where ‘a man hugs a woman while both of them have dark-
skin’ (Figure 2). Given that there is a high level concept called “dark-skinned people”,
the user can initiate her search from it. Then, she can either use the visual similarity
module if a relative image is retrieved during the first step or if she finds an image
that possibly matches the query; she can browse the temporally adjacent shots and
retrieve the desired clip. Finally the user can store the desirable shots in a basket
4 Future Work
Future work includes fusion of high level visual concepts in order to allow for retriev-
al of video shots that can be described equally with more than one concept. Another
feature that could be implemented is the capability of querying the video collection
with one or more colors found in specific place of the shot. However, this requires
knowledge of the specific location of the color in the image.
Acknowledgements This work was supported by the European Commission under
contracts FP7-287911 LinkedTV, FP7-600826 ForgetIT, FP7-610411
1. Schoeffmann, K., Bailer, W.: Video Browser Showdown. ACM SIGMultimedia Records,
vol. 4, no. 2, pp. 1-2 (2012)
2. Apostolidis, E., Mezaris, V.: Fast shot segmentation combining global and local visual de-
scriptors. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 6583-6587 (2014)
3. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: An efficient alternative to SIFT
or SURF. 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2564-
2571 (2011)
4. Su, C.-W., Liao, H.-Y.M., Tyan, H.-R., Fan, K.-C., Chen, L.-H.: A motion-tolerant dis-
solve detection algorithm. IEEE Transactions on Multimedia, vol. 7, pp.1106-1113 (2005)
5. Seo, K.-D., Park, S., Jung, S.-H.: Wipe scene-change detector based on visual rhythm
spectrum. IEEE Transactions on Consumer Electronics, vol. 55, no. 2, pp. 831-838 (2009)
6. Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.:
Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans-
actions on Circuits and Systems for Video Technology, vol. 21(8), pp. 1163-1177 (2011)
7. Yeung, M., Yeo, B.-L., Liu, B.: Segmentation of video by clustering and graph analysis.
Computer Vision and Image Understanding, vol. 71, no. 1, pp. 94-109 (1998)
8. Jegou, H., Douze, M., Schmid, C., Perez. P.: Aggregating local descriptors into a compact
image representation. In Proc. CVPR (2010)
9. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 117-128 (2011)
10. Mandasari, M.I., McLaren, M., van Leeuwen, D.A, Bingham, E., Mannila, H.: Random
projection in dimensionality reduction: Applications to image and text data. 7th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 245–250 (2001)
11. Markatopoulou, F., Moumtzidou, A., Tzelepis, C., Avgerinakis, K., Gkalelis, N., Vrochid-
is, S., Mezaris, V., Kompatsiaris, I.: ITI-CERTH participation to TRECVID 2013. In
TRECVID 2013 Workshop, Gaithersburg, MD, USA (2013)
12. Markatopoulou, F., Mezaris, V., Kompatsiaris, I.: A comparative study on the use of multi-
label classification techniques for concept-based video indexing and annotation. In: C.
Gurrin, F. Hopfgartner, W. Hurst, H. Johansen, H. Lee, and N. OConnor (eds.), MultiMe-
dia Modeling, vol. 8325, pp. 1–12 (2014)
13. Safadi B., Quénot, G.: Re-ranking by local re-scoring for video indexing and retrieval.
20th ACM Int. Conf. on Information and Knowledge Management, pp. 2081–2084 (2011)
14. Johnson, S.C.: Hierarchical Clustering Schemes. Psychometrika, vol. 2, pp. 241-254
... Detailed descriptions of those tools are available in Section 4, while the interested readers might consider the corresponding references in the Reference list for additional details. Out of the nine participating teams [3, 5, 11, 22, 27, 35, 37, 39, 51], six managed to score points during the competition. Further analysis of the logs showed that one of the three non-scoring teams managed to solve one of the tasks but submitted its data using a wrong format. ...
... The VERGE system [35] is an interactive retrieval system that combines advanced retrieval functionalities with a user-friendly interface, and supports the submission of queries and the accumulation of relevant retrieval results. The following indexing and retrieval modules are integrated in the developed search application: a) Visual Similarity Search Module based on K-Nearest Neighbour search operating on an index of lower-dimensional PCA-projected VLAD vectors [28]; b) High Level Concept Detection for predefined concepts by training Support vector machines with annotated data and five local descriptors (e.g. ...
Full-text available
Interactive video retrieval tools developed over the past few years are emerging as powerful alternatives to automatic retrieval approaches by giving the user more control as well as more responsibilities. Current research tries to identify the best combinations of image, audio and text features that combined with innovative UI design maximize the tools performance. We present the last installment of the Video Browser Showdown 2015 which was held in conjunction with the International Conference on MultiMedia Modeling 2015 (MMM 2015) and has the stated aim of pushing for a better integration of the user into the search process. The setup of the competition including the used dataset and the presented tasks as well as the participating tools will be introduced . The performance of those tools will be thoroughly presented and analyzed. Interesting highlights will be marked and some predictions regarding the research focus within the field for the near future will be made.
... The authors incorporated concept screening, video re-ranking by highlighted concepts, relevance feedback and color sketch to refine result sets. The VERGE team [66], [61], [68] incorporated content-based analysis and retrieval modules such as video shot segmentation, concept detection, clustering, as well as visual similarity and object-based search. Similar to the other teams, the authors shifted their models to deep convolutional neural networks both for automatic annotation and similarity search. ...
The last decade has seen innovations that make video recording, manipulation, storage and sharing easier than ever before, thus impacting many areas of life. New video retrieval scenarios emerged as well, which challenge the state-of-the-art video retrieval approaches. Despite recent advances in content analysis, video retrieval can still benefit from involving the human user in the loop. We present our experience with a class of interactive video retrieval scenarios and our methodology to stimulate the evolution of new interactive video retrieval approaches. More specifically, the Video Browser Showdown evaluation campaign is thoroughly analyzed, focusing on the years 2015-2017. Evaluation scenarios, objectives and metrics are presented, complemented by the results of the annual evaluations. The results reveal promising interactive video retrieval techniques adopted by the most successful tools and confirm assumptions about the different complexity of various types of interactive retrieval scenarios. A comparison of the interactive retrieval tools with automatic approaches (including fully automatic and manual query formulation) participating in the TRECVID 2016 Ad-hoc Video Search (AVS) task is discussed. Finally, based on the results of data analysis, a substantial revision of the evaluation methodology for the following years of the Video Browser Showdown is provided.
Full-text available
Digital video enables manifold ways of multimedia content interaction. Over the last decade, many proposals for improving and enhancing video content interaction were published. More recent work particularly leverages on highly capable devices such as smartphones and tablets that embrace novel interaction paradigms, e.g. touch, gesture-based or physical content interaction. In this paper, we survey literature at the intersection of Human-Computer Interaction and Multimedia. We integrate literature from video browsing and navigation, direct video manipulation, video content visualization, as well as interactive video summariza-tion and interactive video retrieval. We classify the reviewed works by the underlying interaction method and discuss the achieved improvements so far. We also depict a set of open problems that the video interaction community should address in the next years.
Full-text available
The abrupt expansion of the Internet use over the last decade led to an uncontrollable amount of media stored in the Web. Image, video and news information has flooded the pool of data that is at our disposal and advanced data mining techniques need to be developed in order to take full advantage of them. The focus of this thesis is mainly on developing robust video analysis technologies concerned with detecting and recognizing activities in video. The work aims at developing a compact activity descriptor with low computational cost, which will be robust enough to discriminate easily among diverse activity classes. Additionally, we introduce a motion compensation algorithm which alleviates any issues introduced by moving camera and is used to create motion binary masks, referred to as compensated Activity Areas (cAA), where dense interest points are sampled. Motion and appearance descriptors invariant to scale and illumination changes are then computed around them and a thorough evaluation of their merit is carried out. The notion of Motion Boundaries Activity Areas (MBAA) is then introduced. The concept differs from cAA in terms of the area they focus on (ie human boundaries), reducing even more the computational cost of the activity descriptor. A novel algorithm that computes human trajectories, referred to as 'optimal trajectories', with variable temporal scale is introduced. It is based on the Statistical Sequential Change Detection (SSCD) algorithm, which allows dynamic segmentation of trajectories based on their motion pattern and facilitates their classification with better accuracy. Finally, we introduce an activity detection algorithm, which segments long duration videos in an accurate but computationally efficient manner. We advocate Statistical Sequential Boundary Detection (SSBD) method as a means of analysing motion patterns and report improvement over the State-of-the-Art.
Conference Paper
Full-text available
This paper introduces an algorithm for fast temporal seg-mentation of videos into shots. The proposed method detects abrupt and gradual transitions, based on the visual similar-ity of neighboring frames of the video. The descriptive ef-ficiency of both local (SURF) and global (HSV histograms) descriptors is exploited for assessing frame similarity, while GPU-based processing is used for accelerating the analysis. Specifically, abrupt transitions are initially detected between successive video frames where there is a sharp change in the visual content, which is expressed by a very low similarity score. Then, the calculated scores are further analysed for the identification of frame-sequences where a progressive change of the visual content takes place and, in this way gradual tran-sitions are detected. Finally, a post-processing step is per-formed aiming to identify outliers due to object/camera move-ment and flash-lights. The experiments show that the pro-posed algorithm achieves high accuracy while being capable of faster-than-real-time analysis.
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2013 by ITI-CERTH. ITI- CERTH participated in the Semantic Indexing (SIN), the Event Detection in Internet Multimedia (MED), the Multimedia Event Recounting (MER) and the Instance Search (INS) tasks. In the SIN task, techniques are developed, which combine new video representations (video tomographs) with existing well-performing descriptors such as SIFT, Bag-of-Words for shot representation, ensemble construction techniques and a multi-label learning method for score re�nement. In the MED task, an e�cient method that uses only static visual features as well as limited audio information is evaluated. In the MER sub-task of MED a discriminant analysis-based feature selection method is combined with a model vector approach for selecting the key semantic entities depicted in the video that best describe the detected event. Finally, the INS task is performed by employing VERGE, which is an in- teractive retrieval application combining retrieval functionalities in various modalities, used previously for supporting the Known Item Search (KIS) task.
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2012 by ITI-CERTH. ITICERTH participated in the Known-item search (KIS), in the Semantic Indexing (SIN), as well as in the Event Detection in Internet Multimedia (MED) and the Multimedia Event Recounting (MER) tasks. In the SIN task, techniques are developed, which combine video representations that express motion semantics with existing well-performing descriptors such as SIFT and Bag-of-Words for shot representation. In the MED task, two methods are evaluated, one that is based on Gaussian mixture models (GMM) and audio features, and a \semantic model vector approach that combines a pool of subclass kernel support vector machines (KSVMs) in an ECOC framework for event detection exploiting visual information only. Furthermore, we investigate fusion strategies of the two systems in an intermediate semantic level or in score level (late fusion). In the MER task, a \model vector approach is used to describe the semantic content of the videos, similar to the MED task, and a novel feature selection method is utilized to select the most discriminant concepts regarding the target event. Finally, the KIS search task is performed by employing VERGE, which is an interactive retrieval application combining retrieval functionalities in various modalities.
Conference Paper
Full-text available
We address the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory usage of the representation. We first propose a simple yet efficient way of aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation. We then show how to jointly optimize the dimension reduction and the indexing algorithm, so that it best preserves the quality of vector comparison. The evaluation shows that our approach significantly outperforms the state of the art: the search accuracy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms.
Conference Paper
Full-text available
Video retrieval can be done by ranking the samples according to their probability scores that were predicted by classifiers. It is often possible to improve the retrieval performance by re-ranking the samples. In this paper, we proposed a re-ranking method that improves the performance of semantic video indexing and retrieval, by re-evaluating the scores of the shots by the homogeneity and the nature of the video they belong to. Compared to previous works, the proposed method provides a framework for the re-ranking via the homogeneous distribution of video shots content in a temporal sequence. The experimental results showed that the proposed re-ranking method was able to improve the system performance by about 18% in average on the TRECVID 2010 semantic indexing task, videos collection with homogeneous contents. For TRECVID 2008, in the case of collections of videos with non-homogeneous contents, the system performance was improved by about 11-13%.
Conference Paper
Full-text available
We announce the public availability of the RWTH Aachen Uni- versity speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model train- ing and decoding. Speaker adaptation, speaker adaptive train- ing, unsupervised training, a finite state automata library, and an efficient tree search decoder are notable components. Compre- hensive documentation, example setups for training and recog- nition, and a tutorial are provided to support newcomers. Index Terms: speech recognition, LVCSR, software
Conference Paper
Exploiting concept correlations is a promising way for boosting the performance of concept detection systems, aiming at concept based video indexing or annotation. Stacking approaches, which can model the correlation information, appear to be the most commonly used techniques to this end. This paper performs a comparative study and proposes an improved way of employing stacked models, by using multi-label classification methods in the last level of the stack. The experimental results on the TRECVID 2011 and 2012 semantic indexing task datasets show the effectiveness of the proposed framework compared to existing works. In addition to this, as part of our comparative study, we investigate whether the evaluation of concept detection results at the level of individual concepts, as is typically the case in the literature, is appropriate for assessing the usefulness of concept detection results in both video indexing applications and in the somewhat different problem of video annotation.
The Video Browser Showdown (VBS) is a live video browsing competition where international researchers, working in the field of interactive video search, evaluate and demonstrate the efficiency of their tools in presence of the audience. The aim of the VBS is to evaluate video browsing tools for efficiency at known-item search (KIS) tasks with a well-defined data set in direct comparison to other tools.
To store and retrieve large-scale video data sets effectively, the process of shot-change detection is an essential step. In this paper, we propose an automatic shot-change detection algorithm based on Visual Rhythm Spectrum. The Visual Rhythm Spectrum contains distinctive patterns or visual features for many different types of video effects. For the improvement of detection speed, the proposed algorithm is executed by using the partial data of digital compressed video. The proposed detection algorithm can be universally applied to various kinds of shot-change categories such as scene-cuts and wipes. The developed wipe detector is implemented and tested with real video sequences containing a variety of wipe types and lengths. It is shown by simulations that the proposed detection algorithm outperforms other existing approaches.
Many video programs have story structures that can be recognized through the clustering of video contents based on low-level visual primitives and the analysis of high-level structures imposed by temporal arrangement of composing elements. In this paper we propose techniques and formulations to match and cluster video shots of similar visual contents, taking into account the visual characteristics and temporal dynamics of video. In addition, we extend theScene Transition Graphrepresentation for the analysis of temporal structures extracted from video. The analyses lead to automatic segmentation of scenes and story units that cannot be achieved with existing shot boundary detection schemes and the building of a compact representation of video contents. Furthermore, the segmentation can be performed on a much reduced data set extracted from compressed video and works well on a wide variety of video programming types. Hence, we are able to decompose video into meaningful hierarchies and compact representations that reflect the flow of the story. This offers a mean for the efficient browsing and organization of video.