Conference PaperPDF Available

VERGE: A video interactive retrieval engine


Abstract and Figures

This paper presents the video retrieval engine VERGE, which combines indexing, analysis and retrieval techniques in various modalities (i.e. textual, visual and concept search). The functionalities of the search engine are demonstrated through the supported user interaction modes.
Content may be subject to copyright.
VERGE: A Video Interactive Retrieval Engine
Stefanos Vrochidis, Anastasia Moumtzidou, Paul King, Anastasios Dimou, Vasileios Mezaris
and Ioannis Kompatsiaris
Informatics and Telematics Institute
6th Km Charilaou-Thermi Road, Thessaloniki, Greece
{stefanos, moumtzid, king, dimou, bmezaris, ikom}
This paper presents the video retrieval engine VERGE,
which combines indexing, analysis and retrieval techniques
in various modalities (i.e. textual, visual and concept
search). The functionalities of the search engine are demon-
strated through the supported user interaction modes.
1 Introduction
Advances in multimedia technologies combined with the
decreasing cost of storage devices and the high penetration
of the world wide web have led to huge and rapidly growing
archives of video content in recent years. This fact places
the need for the development of advanced video search en-
gines that extend beyond traditional text retrieval and ex-
ploit modern image and video analysis techniques, such as
content-based indexing and high level concept extraction.
This paper demonstrates the interactive retrieval engine
VERGE1built by ITI-CERTH2, which supports multimodal
retrieval functionalities including text, visual, and concept
search. Earlier versions of VERGE have been used in a
number of video evaluation workshops (TRECVID 2006,
2007 2008, 2009) [1], [2], as well as in related events, such
as VideOlympics 2007, 2008 and 2009 [3]. In the rest of
the paper, section 2 discusses the problem addressed by
VERGE, while in section 3 a detailed description of the lat-
ter is presented. Section 4 demonstrates the search engine
through use cases and finally section 5 concludes the paper.
2 Problem Addressed
Due to the already discussed growth of accessible video
content, there is an increasing user demand for searching
in video collections, in order to spot specific incidents or
2Informatics and Telematics Institute - Centre for Research & Technol-
ogy Hellas
events. For instance, a user could be interested in finding a
scene in a movie, in which two actors are arguing, or view
a part of a documentary where a famous politician is speak-
ing to a crowd. In the case that no search capabilities are
available, the user would have to browse the entire video to
find the desired scene, a very time-consuming and difficult
task, especially when the video is long and the target scene
is short in terms of time.
To support the user in such a use case, video should
be pre-processed in order to be indexed in smaller seg-
ments and semantic information should be extracted. The
proposed video search engine is built upon a framework
that employs modern image and video analysis technolo-
gies to support the user in such search tasks. The performed
analysis targets general case video data (i.e. documentaries,
sports, educational videos, etc.), so it can be applied in al-
most any domain.
3 System Description
VERGE is an interactive video retrieval system, which
realizes the framework of Figure 1. It combines basic re-
trieval functionalities in various modalities (i.e., visual, tex-
tual), accessible through a friendly Graphical User Inter-
face (GUI) (Figure 2). The system supports the submission
of hybrid queries that combine the available retrieval func-
tionalities, as well as the accumulation of relevant retrieval
results. The following basic indexing and retrieval modules
are integrated in the developed search application:
Visual Similarity Indexing Module;
Text Processing and Recommendation Module;
High Level Concept Extraction Module;
Visual and Textual Concepts Fusion Module;
Besides the basic retrieval modules, the system inte-
grates a set of complementary functionalities, including
Proc. 8th International Workshop on Content-Based Multimedia Indexing (CBMI 2010), Grenoble, France, June 2010, pp. 142-147.
Video Source
Shot Boundaries
Shot Segmentation
Textual Information
Processing and
Video shots
High Level Concept
Extraction Module
Graphical User Interface
Query Processing
Visual Similarity
Indexing Module
High Level Visual and Textual
Concepts Fusion Module
Figure 1. Architecture of the video search en-
Suggested keywords
Visual Color Search - Mpeg7 Visual Search - Mpeg7
Visual Concept Search
Text Concept Search Stored Results
Bag of Words - Sift
Main Results Area
Color Filter
Video Shots & Side Shots
Hybrid Text and Concept Search
Figure 2. User interface of the VERGE search
temporal queries, color filtering options, fast video shot pre-
view, as well as a basket storage structure. More specif-
ically, the system supports basic temporal queries such as
the presentation of temporally adjacent shots of a specific
video shot and the shot-segmented view of each video. In
addition, the system supports a fast shot preview by rolling
three different keyframes. In that way, the user obtains ad-
equate information of the video shot in a very short time.
Furthermore, the system offers a color filter option that fil-
ters the presented results to either grayscale or color images.
Finally, the selected shots, considered to be relevant to the
query, can be stored by the user by employing a storage
structure that mimics the functionality of the shopping cart
found in electronic commerce sites. In the next sections we
will describe the VERGE interface, the aforementioned ba-
sic retrieval modules and the system’s specifications.
3.1 Interface Description
The Graphical User Interface (GUI) of VERGE is com-
prised of basically two parts, i.e. the surrounding layout
and the results container. The surrounding layout of the
interface contains the necessary options for formulating a
query, presented in the left column, and the storage struc-
ture (basket) on the top of the page, where the user can store
the relevant images. The text search resides in the left col-
umn of the GUI, where the user can insert a keyword or a
phrase to search. A sliding bar residing just below this form
is used for combining the results of the textual and visual
concepts by assigning weights to each technique. Below
this slider, the related terms of a text query are presented,
which are produced dynamically according to the inserted
textual information and aim at helping the user by suggest-
ing broader, narrower and related terms. Underneath, there
is a color filter option that limits the retrieved results to ei-
ther colored or grayscale shots. Finally, on the bottom, the
high level visual concepts are listed as a hierarchy for easier
The results container, which is located on the main part
of the interface, comprises the part of the GUI, in which the
results of a query are presented. Specific information and
links are provided underneath each presented shot offering
the following options to the user in the form of clickable
buttons: a) to view all the shots of the specific video, b) to
view the 12 temporally adjacent shots, c) to search for vi-
sually similar shots based on MPEG-7 Color descriptors, d)
to search for visually similar shots exploiting SIFT descrip-
tors, and e) to submit a shot to the basket. In addition, the
user can watch a short preview of any shot by hovering the
mouse over it, and can fire a visual search based on MPEG-
7 Color and Texture descriptors by clicking on the specific
Proc. 8th International Workshop on Content-Based Multimedia Indexing (CBMI 2010), Grenoble, France, June 2010, pp. 142-147.
3.2 Visual Similarity Indexing Module
The visual search module exploits the visual content of
images in the process of retrieving visually similar results.
Given that the input considered is video, these images are
obtained by representing each shot with its temporally mid-
dle frame, called the representative keyframe. Visual simi-
larity search is realized by extracting either global informa-
tion, such as the MPEG-7 visual descriptors that capture dif-
ferent aspects of human perception (i.e. color and texture)
from representative keyframes of video shots, or local in-
formation such as interest points described using [4]. Such
information was extracted from the representative keyframe
of each shot.
In the first case, we formulate a feature vector by con-
catenating different MPEG-7 descriptors to compactly rep-
resent each image in a multidimensional space. Based on an
empirical evaluation of the system’s performance, two dif-
ferent schemas where selected. The first one relied on color
and texture, while the second scheme relied solely on color.
In the second case, the method employed was an imple-
mentation of the bag-of-visual-words approach as described
in [5]. A large amount of local descriptors extracted from
different frames was clustered to a fixed number of clusters
using a k-means algorithm, and the resulting cluster cen-
ters were selected as visual words. Each image was sub-
sequently described by a single feature vector that corre-
sponded to a histogram generated by assigning each key-
point of the image to each closest visual word. The number
of selected clusters (100 in our case) determined the number
of bins and, consequently the dimensionality of the result-
ing feature vectors.
To enable faster retrieval, a multi-dimensional r-tree [6]
indexing structure was constructed off-line using the feature
vectors of all the shots. During the query phase the fea-
ture vector of the query image is fed to the index structure,
which outputs a set of keyframes that are found to resem-
ble the query one. Since the order of these identifiers is not
ranked according to their level of similarity with the query
example, an additional step for ranking these keyframes, us-
ing custom distance metrics between their feature vectors,
was further applied to yield the final retrieval outcome.
3.3 Text Processing and Recommenda-
tion Module
The text search module exploits any textual informa-
tion, which is associated with the video. This implementa-
tion utilizes text, which was extracted by Automatic Speech
Recognition (ASR) from the embedded audio. The result-
ing annotations were used to create a full-text index utiliz-
ing Lemur [7], a toolkit designed to facilitate research in
language modelling. The standard pipeline of text process-
ing techniques was used in creating the index, including
stopwords removal and application of the Porter [8] algo-
To assist the user in query iteration tasks, a hierarchical
navigation menu of suggested keywords was generated at
runtime from each query submission. Conventional term re-
lationships were supplied by a local WordNet database [11],
whereby hypernyms were mapped to broader terms and hy-
ponyms to narrower terms. Related terms were provided by
synset terms that were not used for automatic query expan-
sion [2].
All standard post-coordination techniques were imple-
mented for query formulation and editing tasks, includ-
ing logical Boolean operators (e.g., OR, AND, and NOT),
term grouping (using parentheses), and phrases (delimited
by quotes), thus allowing the user to manipulate expanded
query terms to improve precision.
Recall was boosted by automatic query expansion, also
based on WordNet, whereby a list of expanded terms were
generated from WordNet synsets. The system was con-
strained to using the first term in the query string due to
the observation that short queries often produce more ef-
fective expansions [9]. Terms were chosen for expansion
by measuring the semantic similarity between each synset
term and the original (initial) query term by utilizing the
extended gloss overlap method [10].
Although performance of the module is satisfactory in
terms of time-efficiency and navigation, the quality of the
results greatly depends on the reliability of the speech tran-
3.4 High Level Concept Extraction Mod-
This search modality offers selection of high level visual
concepts (e.g. water, aircraft, landscape, crowd, etc.). The
extraction of high level concept information was based on
the approach described in [2]. A set of MPEG-7-based fea-
tures were concatenated to form a single MPEG-7 feature
vector for every shot, while a Bag-of-Words feature based
on SIFT descriptors of local interest points was also calcu-
lated for every shot, as described in section 3.2.
A set of SVM classifiers (LIBSVM) [12] was trained for
each high level visual concept and each possible shot repre-
sentation (MPEG-7, SIFT) separately. Subsequently, a sec-
ond set of SVM classifiers, one per high level concept, was
trained for fusing the previous classification results and pro-
duce a final score in the range [0,1] associating each shot
with each considered high level concept. In the present
version of the VERGE engine, 42 of these concepts (e.g.
building, car, mountain, snow, etc.) are considered. These
concepts are depicted in the GUI in the form of a hierarchy
(Figure 2).
Proc. 8th International Workshop on Content-Based Multimedia Indexing (CBMI 2010), Grenoble, France, June 2010, pp. 142-147.
3.5 Visual and Textual Concepts Fusion
This module combines high level visual concepts, with
textual information by applying a manually assisted linear
fusion. From a usability point of view, the user provides a
keyword and specifies, with the aid of a slider, the relevant
significance of the textual versus the visual results.
The procedure is reflected by Equation 1, where iis a
specific shot, Simiis the final similarity score after the fu-
sion, V Scoreiis the normalized degree of confidence for
a given visual concept, T Scoreiis the normalized similar-
ity score of the textual module and, finally, αand βare the
weights assigned to their original values, respectively.
Simi=α·V S corei+β·T S corei, where α+β= 1 (1)
It should be noted that the similarity scores are normal-
ized to a range between 0 and 1, where the higher the value,
the more relevant it is to the query. Obviously, if one of the
weights is set to zero or one, the results obtained are either
exclusively textual or visual.
3.6 Implementation Issues
The search system, combining the aforementioned mod-
ules, was built on open source web technologies, more
specifically Apache server, PHP, JavaScript, mySQL data-
base, Strawberry Perl. It also requires the Indri Search En-
gine that is part of the Lemur Toolkit [7] and a local instal-
lation of WordNet.
Textual Search for keyword 'vehicle'
Figure 3. Textual search for keyword ‘vehicle’.
4 Interaction Modes
In this section, we present two use cases of the VERGE
system to demonstrate its functionalities.
In the first usage scenario we suppose that a user is inter-
ested in finding ‘street scenes with moving vehicles’. At
first, the user inserts the keyword ‘vehicle’ into the text
query box and moves the slider to the left, which leads to
placing emphasis on the textual information generated from
ASR (Figure 3). Although the results retrieved are quite
satisfactory, it is expected that by increasing the weight as-
signed to the high level visual concept module, the results
will be enhanced. Therefore, the user moves the textual-
versus-visual information slider to the right, thus assigning
more weight on the results obtained by processing the visual
information. Figure 4 depicts the output obtained, demon-
strating that the high level visual concept module improves
the results. In addition, the user can view the sideshots (i.e.
the temporally adjacent shots) and the associated textual in-
formation of a specific video shot. Figure 5 illustrates the
interface when sideshots are visualized. From the results
it can be observed that some of the sideshots depict vehi-
cles, which is very reasonable as temporally adjacent shots
are usually semantically related. Finally, the user selects
a shot from the retrieved results and searches for visually
similar ones, based on the shot’s MPEG-7 color and texture
features. By executing this query, more relevant results are
retrieved, as it is illustrated in Figure 6.
High Level Visual Concept 'vehicle' Results
Figure 4. High level visual search for concept
In the second use case, the user is searching for video
shots depicting ‘scenes with crowd’. At first, the user in-
Proc. 8th International Workshop on Content-Based Multimedia Indexing (CBMI 2010), Grenoble, France, June 2010, pp. 142-147.
serts the keyword ‘crowd’ into the text query box and moves
the slider to the middle. The shots retrieved (Figure 7) re-
sult from the linear fusion of the results of the textual and
the high level visual concept modules. To retrieve more re-
lated shots, the user selects one of the retrieved keyframes
and searches for visually similar shots using MPEG-7 color
features. The retrieved results are illustrated in Figure 8.
Finally, the user can search using local features (SIFT) by
clicking on the ‘BoW’ button (Figure 9) of a specific video
Figure 5. Temporally adjacent shots.
Visual Similar Images using MPEG -7 Color and Texture Descriptors
Figure 6. Visual search using MPEG-7.
Combine Textual and High Level Visual Concept Results for 'crowd'
Figure 7. Linear fusion of textual and high level
visual concept module results for the query
Mpeg -7 Color-based Visual Search for Selected Query Image
Figure 8. Visual search using color-based
MPEG-7 features.
Proc. 8th International Workshop on Content-Based Multimedia Indexing (CBMI 2010), Grenoble, France, June 2010, pp. 142-147.
Visual Search using SIFT
Figure 9. Visual search using SIFT.
5 Summary and Future work
This paper presented a description of the interactive
video search engine VERGE. Its performance was demon-
strated through different user interaction modes in order to
reveal its potential. Currently VERGE is employed in re-
search experiments dealing with the exploitation of user im-
plicit feedback, expressed by user navigation patterns (i.e.
mouse clicks and keystrokes), as well as by sensor outputs
(e.g. eye tracker) during interactive retrieval tasks, in order
to optimize and complement existing content based retrieval
functionalities and provide recommendations. Future activ-
ities include the adaptation of VERGE modules to support
domain specific search with a view to retrieving web pages
offering environmental services.
The website of VERGE search engine is available at:, including up to date informa-
tion about the latest implementations, video tutorials and
demonstrations, as well as links to the different online ver-
sions of VERGE.
6 Acknowledgements
This work was supported by the European projects
PESCaDO (FP7-248594), CHORUS+ (FP7-249008) and
GLOCAL (FP7-248984), funded by the European Commis-
[1] A. F. Smeaton, P. Over, and W. Kraaij. Evaluation cam-
paigns and trecvid, In Proceedings of the 8th ACM Interna-
tional Workshop on Multimedia Information Retrieval, pp
321–330, New York, NY, USA, 2006.
[2] A. Moumtzidou, A. Dimou, P. King, S. Vrochidis, A. An-
geletou, V. Mezaris, S. Nikolopoulos, I. Kompatsiaris, L.
Makris. ITI-CERTH participation to TRECVID 2009 HLFE
and Search, 7th TRECVID Workshop, Gaithersburg, USA,
November 2009, 2009.
[3] C.G.M. Snoek, M. Worring, O. de Rooij, K.E.A. van de
Sande, V. Rong, A.G. Hauptmann. VideOlympics: Real-
Time Evaluation of Multimedia Retrieval Systems, Multi-
media, IEEE, 15(1):86–91, 2008.
[4] D. G. Lowe. Distinctive image features from scale-invariant
keypoints, International Journal of Computer Vision,
60(2):91–110, 2004.
[5] J. Sivic and A. Zisserman. Video google: a text retrieval
approach to object matching in videos, in Proceedings of
the International Conference on Computer Vision, pp 1470–
1477, 2003.
[6] A. Guttman. R-trees: a dynamic index structure for spatial
searching. In SIGMOD ’84: Proceedings of the 1984 ACM
SIGMOD international conference on Management of data,
pages 47–57, New York, NY, USA, 1984. ACM.
[7] The lemur toolkit. lemur.
[8] M. F. Porter. An algorithm for suffix stripping, Program,
14(3):130–137, 1980.
[9] H. Fang and C. X. Zhai. Semantic term matching in ax-
iomatic approaches to information retrieval. In SIGIR ’06:
Proceedings of the 29th annual international ACM SIGIR
conference on Research and development in information re-
trieval, pages 115–122, New York, NY, USA, 2006. ACM.
[10] S. Banerjee and T. Pedersen. Extended gloss overlaps as
a measure of semantic relatedness. In Proceedings of the
Eighteenth International Joint Conference on Artificial In-
telligence, pages 805–810, 2003.
[11] C. Fellbaum, editor. WordNet: An Electronic Lexical Data-
base (Language, Speech, and Communication), The MIT
Press, May 1998.
[12] C. C. Chang and C. J. Lin. LIBSVM: a library for support
vector machines, 2001.
Proc. 8th International Workshop on Content-Based Multimedia Indexing (CBMI 2010), Grenoble, France, June 2010, pp. 142-147.
... This aggregation consists of merging these different interpretations in a single container. In current systems [4], [21], [5], [22], [8], and [23], fusion is used not only for aggregation, but also for improving the quality of the semantic interpretations. ...
... In [23], Verge is an interactive video retrieval system. Its indexing process uses a fusion system that combines visual and textual concepts. ...
... In other situations, we find that the concept relevance analysis in a video content varies from one modality to another. This conflict over the relevance degree for the same concept is solved using the equation 3. (Same equation used in [23]). ...
Conference Paper
Full-text available
In this paper, we propose a semantic indexing system for reducing the semantic gap between the machine and human interpretations on a video document by generating a finer indexing quality. To do so, data fusion of analyzed interpretation (concepts) from different video modalities (sound and images) is viewed as the required choice. Considering problems that can emanate from employing data fusion, our approach is based on some intelligent techniques aiming to improve semantic indexing quality of the REGIMVID framework. At first, we use fuzzy logic to deal with uncertain and conflicting situations. An ontology is exploited for laying out relationships between different concepts. Finally, a deductive and inductive inference engines are used to discover respectively further concepts and further relationships between concepts. We developed a fusion system that responds, in its first version, to such point treated in our approach. Our participation in the semantic indexing task of the TRECVID2010 competition has led to promising results. Index Terms—Data Fusion Model, Multimodal Fuzzy Fusion, Semantic Video Indexing.
... We can recognize that level 1 has a great importance since it eliminates any irrelevant information. However, several actual indexing systems do not account well for this component (as in [Vrochidis et al. 2010], [C. G. M. Snoek et al. 2006] and [Ayache et al. 2007]). ...
... is solved using the equation 4.3. (Same equation used in[Vrochidis et al. 2010]). ...
Full-text available
Our thesis work deals with the video indexing based on semantic interpretation (an abstrac- tion of objects or events that figure in a content), more particularly, the semantic indexing enhancement. Various approaches for semantic multimedia content analysis have been pro- posed addressing the discovery of features ranging from low-level features (color, histograms, sound frequency, motions, . . . ) to high-level ones (semantic objects and concepts). However, these earlier approaches failed to reduce the semantic gap and were not able to deliver an accurate semantic interpretation. Under such a context, exploring further semantics within a multimedia content to improve semantic interpretation capabilities, is a major and a pre- requisite challenge. Towards exploring further semantic information within a multimedia content (other than low-level and semantic concepts one), valuable information (mainly concepts interrelation- ships and contexts) could be gathered from a multimedia content in order to enhance semantic interpretation capabilities. Motivated by a kindred vision of human perception, yet targeting automated analysis of a multimedia content, the multimedia retrieval community addressed more attention to multimedia ontologies. Aiming to contribute towards this direction, we focus on modeling an automated fuzzy context-based ontology framework for enhancing a video indexing accuracy and efficiency. Key dimensions of this inquiry constitute the main issues addressed by the use of ontologies for multimedia indexing, namely: (1) the knowledge management and evolution, (2) the ability to handle uncertain knowledge and to deal with fuzzy semantics, and (3) the scalability and the ability to process a growing multimedia content volume with a continuous request for a better machine semantic interpretation capacities. What was accomplished in our study is a novel ontology management which is intended to a machine-driven knowledge database construction. Such a method could enable semantic improvements in large-scale multimedia content analysis and indexing. In order to illustrate the semantic enhancement of concept detection introduced by our proposed scalable and generic ontology-based framework, we have conducted different ex- periments within three multimedia evaluation campaigns: TrecVid 2010 (within Semantic Indexing Task), ImageClef 2012 (within Photo Annotation and Retrieval Task), and Image- Clef 2015 (within Scalable Concept Image Annotation Task).
... An incremental upgrade to the VERGE system presented by Vrochidis et al. [Vrochidis et al. 2010] is proposed by Moumtzidou et al. Their system also incorporates the recording of users interactions, which is utilized in the interface to tune search results [Moumtzidou et al. 2011;Moumtzidou et al. 2012]. ...
... Supported Type of Interaction 1 A B C DM N Q S [Adams et al. 2012] x x x [Al-Hajri et al. 2013] x x [Aly et al. 2012] x x x [Azzopardi et al. 2012] x x x x [Bailer et al. 2014] x x x x [Brachmann and Malaka 2009] x x [Chaisorn et al. 2010] x x x x x [Del Fabro et al. 2013] x x x [de Rooij et al. 2010] x x x [Friedland et al. 2009] x x x x [Girgensohn et al. 2011] x x [Jackson et al. 2013] x x x [Le et al. 2012] x x x x [Little et al. 2012] x x x [Lokoc et al. 2014] x x [Luan et al. 2011] x x [Matejka et al. 2012] x [Matejka et al. 2013] x [McGuinness et al. 2011] x x x x [Moumtzidou et al. 2011] x x [Moumtzidou et al. 2012] x x [Moumtzidou et al. 2014] x x x x x x x [Neng and Chambel 2010] x x [Palotai et al. 2014] x x x [Pavel et al. 2014] x x x x x [Pongnumkul et al. 2010] x [Schoeffmann and Boeszoermenyi 2011] x x x [Scott et al. 2014] x x x x [Sjöoberg et al. 2010] x x x x x [Ventura et al. 2012] x x [Viaud et al. 2010] x x x [Vrochidis et al. 2010] x x [Xu et al. 2014] x x x [Yuan et al. 2012] x x x ...
Full-text available
Digital video enables manifold ways of multimedia content interaction. Over the last decade, many proposals for improving and enhancing video content interaction were published. More recent work particularly leverages on highly capable devices such as smartphones and tablets that embrace novel interaction paradigms, e.g. touch, gesture-based or physical content interaction. In this paper, we survey literature at the intersection of Human-Computer Interaction and Multimedia. We integrate literature from video browsing and navigation, direct video manipulation, video content visualization, as well as interactive video summariza-tion and interactive video retrieval. We classify the reviewed works by the underlying interaction method and discuss the achieved improvements so far. We also depict a set of open problems that the video interaction community should address in the next years.
... It also defines a space of visual similarity, a space of semantic similarity, a semantic thread space and browsers to exploit these spaces. It is worth mentioning that the VERGE approach [7] supports the following functions: (i) a high level of visual conceptual retrieval and (ii) a visual retrieval. This tool combines indexing, analysis and recovery techniques of diverse modalities (textual, visual and conceptual). ...
Conference Paper
Following technological advances carried out recently, there has been an explosion in the quantity of videos available and their accessibility. This is largely justified by the fall of the prices of acquisition and the increase of the capacity of the memory supports, which made the storage of the large document video in computer system possible. To allow an effective exploitation of the collections, it is necessary to install tools facilitating the access to the documents and handle them. In this context, we propose a multimedia retrieval approach that puts the user at the center of the retrieval process starting from a text query. The new aspects of our proposal is as follows: (i) concerning the indexation part, we propose a new approach allowing a multilevel and semantic classification of videos, (ii) regarding the retrieval part, the inclusion of query expansion mechanism helps the user to formulate the query and the relevance feedback mechanism which helps improve the results considering the user's feedback. Our contribution at the experimental level consists in the implementation of prototype VISEN. In fact the technique proposed have been integrated in system seeks by the contents to evaluate the contribution in terms of effectiveness and precision. After carrying out a set of tests on 2700 videos and 62838 images, the experimental results showed that the proposed algorithm performs well.
... It defines a space of visual similarity, a space of semantic similarity, a space of semantic "thread" and browsers to exploit these spaces. Let us also mention the VERGE approach [5], which supports the following functions: search by high-level visual concept and textual search. This tool combines indexing, analysis and recovery techniques in a variety of ways (textual, visual and conceptual). ...
Full-text available
Videos clips became the most important and prominent multimedia document to illustrate the rituals process of Hajj and Umrah. Therefore, it is necessary to develop a system to facilitate access to information related to the duties, the pillars, the stages and the prayers. In this paper present a new project accomplishing a search engine in a large video database enabling any pilgrims to get the information that he care about as fast, accurate. This project is based on two techniques: (a) the weighting method to determine the degree of affiliation of a video clip to a particular topic (b) organizing data using several layers.
... It defines a space of visual similarity, a space of semantic similarity, a space of semantic "thread" and browsers to exploit these spaces. Let us also mention the VERGE approach [5], which supports the following functions: search by high-level visual concept and textual search. This tool combines indexing, analysis and recovery techniques in a variety of ways (textual, visual and conceptual). ...
Conference Paper
Full-text available
Videos clips became the most important and prominent multimedia document to illustrate the rituals process of Hajj and Umrah. Therefore, it is necessary to develop a system to facilitate access to information related to the duties, the pillars, the stages and the prayers. In this paper present a new project accomplishing a search engine in a large video database enabling any pilgrims to get the information that he care about as fast, accurate. This project is based on two techniques: (a) the weighting method to determine the degree of affiliation of a video clip to a particular topic (b) organizing data using several layers.
This paper discusses MAC-REALM, a framework for extraction of syntactic and semantic content features and content modelling with either little or no user interaction. The framework integrates a four filter-plane strategy: a pre-processing plane that filters redundant data, a syntactic feature extraction plane that filters syntactic features, a semantic relationships analysis and linkage plane that filters the spatial and temporal relationships of content features, and finally a content modelling plane where the syntactic and semantic content features are integrated into a content model. Each of the four planes is split into three layers: the content layer, where the content to be processed is stored; the application layer, where the content is converted into content descriptions; and the MPEG-7 layer, where content descriptions are serialized. Using MPEG-7 standards to produce the content model will provide wide-ranging interoperability while facilitating granular multi-content-type searches. MAC-REALM aims at ‘bridging’ the semantic gap, by integrating the syntactic and semantic content features from extraction through to modelling.
Conference Paper
Overwhelming amounts of surveillance video data are increasingly screwed up the pressure on efficient content-based retrieval and other applications. However, semantic gap exists between the low-level visual signal processing and high-level semantic understanding of the video event. In this paper, we propose an ontology-based content archive and retrieval framework for surveillance videos. Different from the generalized multimedia ontology framework, surveillance domain ontology is first designed as the content description schema, based on which video data is analyzed to form description files in Web Ontology Language (OWL). And then, a web-based semantic retrieval engine, which is compatible with the OWL query API, is developed to provide indexing service. Case study of “walking people” and “car parking” demonstrates that the proposed framework could generate OWL description of a video clip, and reversely locate the information efficiently.
Full-text available
The increased in availability and usage of on-line digital video has created a need of automated video content analysis techniques, including indexing and retrieving. Automation of indexing significantly reduces the processing cost while by minimizing tedious work. Traditional video retrieval methods based on video metadata, fail to meet technical challenges due to large and rapid growth of multimedia data, demanding effective retrieval systems. One of the most popular solutions for indexing is extracting the features of video key frames for developing a Content Based Video Retrieval (CBVR) system. CBVR works more effectively as these deals with content of video rather than video metadata. Various features like color, texture, shape can be integrated and used for video indexing and retrieval. Implemented CBVR system is experimented based on integration of texture, color and edge features for video retrieval. Entropy is a texture descriptor used for key frame extraction and video indexing. However entropy, color (RGB) and edge detection algorithms are used for video retrieval. These features are combined in various ways like entropy-edge, entropy-color for result refinement. Dataset is created with the videos from different domains like e-learning, nature, construction etc. By the combination of these features in different ways, we achieved comparative results. Obtained result shows that combining of two or many features gives better retrieval.
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2012 by ITI-CERTH. ITICERTH participated in the Known-item search (KIS), in the Semantic Indexing (SIN), as well as in the Event Detection in Internet Multimedia (MED) and the Multimedia Event Recounting (MER) tasks. In the SIN task, techniques are developed, which combine video representations that express motion semantics with existing well-performing descriptors such as SIFT and Bag-of-Words for shot representation. In the MED task, two methods are evaluated, one that is based on Gaussian mixture models (GMM) and audio features, and a \semantic model vector approach that combines a pool of subclass kernel support vector machines (KSVMs) in an ECOC framework for event detection exploiting visual information only. Furthermore, we investigate fusion strategies of the two systems in an intermediate semantic level or in score level (late fusion). In the MER task, a \model vector approach is used to describe the semantic content of the videos, similar to the MED task, and a novel feature selection method is utilized to select the most discriminant concepts regarding the target event. Finally, the KIS search task is performed by employing VERGE, which is an interactive retrieval application combining retrieval functionalities in various modalities.
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2009 by ITI-CERTH. ITI-CERTH participated in the high-level feature extraction task and the search task. In the high-level feature extraction task, techniques are developed that combine motion information with existing well-performing descriptors such as SIFT and Bag-of-Words for shot representation. In a separate run, the use of compressed video information to form a Bag-of-Words model for shot representation is studied. The search task is based on an interactive retrieval application combining retrieval functionalities in various modalities (i.e. textual, visual and concept search) with a user interface supporting interactive search over all queries submitted. Evaluation results on the submitted runs for this task provide interesting conclusions regarding the comparison of the involved retrieval functionalities as well as the strategies in interactive video search.
Conference Paper
Full-text available
In order to handle spatial data efficiently, as required in computer aided design and geo-data applications, a database system needs an index mechanism that will help it retrieve data items quickly according to their spatial locations However, traditional indexing methods are not well suited to data objects of non-zero size located m multi-dimensional spaces In this paper we describe a dynamic index structure called an R-tree which meets this need, and give algorithms for searching and updating it. We present the results of a series of tests which indicate that the structure performs well, and conclude that it is useful for current database systems in spatial applications
with a preface by George Miller WordNet, an electronic lexical database, is considered to be the most important resource available to researchers in computational linguistics, text analysis, and many related areas. Its design is inspired by current psycholinguistic and computational theories of human lexical memory. English nouns, verbs, adjectives, and adverbs are organized into synonym sets, each representing one underlying lexicalized concept. Different relations link the synonym sets. The purpose of this volume is twofold. First, it discusses the design of WordNet and the theoretical motivations behind it. Second, it provides a survey of representative applications, including word sense identification, information retrieval, selectional preferences of verbs, and lexical chains. Contributors: Reem Al-Halimi, Robert C. Berwick, J. F. M. Burg, Martin Chodorow, Christiane Fellbaum, Joachim Grabowski, Sanda Harabagiu, Marti A. Hearst, Graeme Hirst, Douglas A. Jones, Rick Kazman, Karen T. Kohl, Shari Landes, Claudia Leacock, George A. Miller, Katherine J. Miller, Dan Moldovan, Naoyuki Nomura, Uta Priss, Philip Resnik, David St-Onge, Randee Tengi, Reind P. van de Riet, Ellen Voorhees.
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Conference Paper
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.
Conference Paper
A common limitation of many retrieval models, including the recently proposed axiomatic approaches, is that retrieval scores are solely based on exact (i.e., syntactic) matching of terms in the queries and documents, without allowing dis- tinct but semantically related terms to match each other and contribute to the retrieval score. In this paper, we show that semantic term matching can be naturally incor- porated into the axiomatic retrieval model through defining the primitive weighting function based on a semantic simi- larity function of terms. We define several desirable retrieval constraints for semantic term matching and use such con- straints to extend the axiomatic model to directly support semantic term matching based on the mutual information of terms computed on some document set. We show that such extension can be efficiently implemented as query ex- pansion. Experiment results on several representative data sets show that, with mutual information computed over the documents in either the target collection for retrieval or an external collection such as the Web, our semantic expansion consistently and substantially improves retrieval accuracy over the baseline axiomatic retrieval model. As a pseudo feedback method, our method also outperforms a state-of- the-art language modeling feedback method.