Abstract and Figures

This paper describes in detail VISIONE, a video search system that allows users to search for videos using textual keywords, the occurrence of objects and their spatial relationships, the occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and meet users' needs. The peculiarity of our approach is that we encode all information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding that is indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval performance of the system, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies from those we tested.
Content may be subject to copyright.
The VISIONE Video Search System: Exploiting Off-the-Shelf
Text Search Engines for Large-Scale Video Retrieval
Giuseppe Amato , Paolo Bolettieri , Fabio Carrara , Franca Debole , Fabrizio Falchi , Claudio Gennaro ,
Lucia Vadicamo * and Claudio Vairo
Citation: Amato, G.; Bolettieri, P.;
Carrara, F.; Debole, F.; Falchi, F.;
Gennaro, C.; Vadicamo, L.; Vairo, C.
The VISIONE Video Search System:
Exploiting Off-the-Shelf Text Search
Engines for Large-Scale Video
Retrieval. Preprints 2021,7, 76.
Academic Editor: Gonzalo Pajares
Received: 22 March 2021
Accepted: 20 April 2021
Published: 23 April 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
Institute of Information Science and Technologies (ISTI), Italian National Research Council (CNR),
Via G. Moruzzi 1, 56124 Pisa, Italy; giuseppe.amato@isti.cnr.it (G.A.); paolo.bolettieri@isti.cnr.it (P.B.);
fabio.carrara@isti.cnr.it (F.C.); franca.debole@isti.cnr.it (F.D.); fabrizio.falchi@isti.cnr.it (F.F.);
claudio.gennaro@isti.cnr.it (C.G.); claudio.vairo@isti.cnr.it (C.V.)
*Correspondence: lucia.vadicamo@isti.cnr.it
This paper describes in detail VISIONE, a video search system that allows users to search
for videos using textual keywords, the occurrence of objects and their spatial relationships, the
occurrence of colors and their spatial relationships, and image similarity. These modalities can be
combined together to express complex queries and meet users’ needs. The peculiarity of our approach
is that we encode all information extracted from the keyframes, such as visual deep features, tags,
color and object locations, using a convenient textual encoding that is indexed in a single text retrieval
engine. This offers great flexibility when results corresponding to various parts of the query (visual,
text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval
performance of the system, using the query logs generated during the Video Browser Showdown
(VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters
and strategies from those we tested.
content-based video retrieval; surrogate text representation; known item search; Ad-hoc
video search; multimedia and multimodal retrieval; multimedia information systems; information
systems applications; video search; image search; users and interactive retrieval; retrieval models
and ranking; users and interactive retrieval
1. Introduction
With the pervasive use of digital cameras and social media platforms, we witness
a massive daily production of multimedia content, especially videos and photos. This
phenomenon poses several challenges for the management and retrieval of visual archives.
On one hand, the use of content-based retrieval systems and automatic data analysis is
crucial to deal with visual data that typically are poorly-annotated (think for instance of
user-generated content). On the other hand, there is an increasing need for scalable systems
and algorithms to handle ever-larger collections of data.
In this work, we present a video search system, named VISIONE, which provides
users with various functionalities to easily search for targeted videos. It relies on artificial
intelligence techniques to automatically analyze and annotate visual content and employs
an efficient and scalable search engine to index and search for videos. A demo of VISIONE
running on the V3C1 dataset, described in the following, is publicly available at (http:
//visione.isti.cnr.it/, accessed on 22 April 2021).
VISIONE participated in the Video Browser Showdown (VBS) 2019 challenge [
VBS is an international video search competition [
] that evaluates the performance
of interactive video retrieval systems. Performed annually since 2012, it is becoming
increasingly challenging as its video archive grows and new query tasks are introduced
in the competition. The V3C1 dataset [
] used in the competition since 2019 consists of
7475 videos gathered from the web, for a total of about 1000 hours. The V3C1 dataset is
segmented into 1,082,657 non-overlapping video segments, based on the visual content
of the videos [
]. The shot segmentation for each video as well as the keyframes and
thumbnails per video segment are available within the dataset (https://www-nlpir.nist.
2 of 26
gov/projects/tv2019/data.html, accessed on 22 April 2021). In our work, we used the video
segmentation and the keyframes provided with the V3C1 dataset. The tasks evaluated
during the competition are: Known-Item-Search (KIS) , textual KIS and Ad-hoc Video
Search (AVS). Figure 1gives an example of each task. The KIS task models the situation
in which a user wants to find a particular video clip that he or she has previously seen,
assuming that it is contained in a specific data collection. The textual KIS is a variation of
the KIS task, where the target video clip is no longer visually presented to the participants
of the challenge, but it is rather described in detail by some text. This task simulates
situations in which a user wants to find a particular video clip, without having seen
it before, but knowing exactly its content. For the AVS task, instead, a general textual
description is provided and participants have to find as many correct examples as possible,
i.e., video shots that match the given description.
Textual KIS
A slow pan up from a canyon, static
shots of a bridge and red rock
mountain. A river is visible at the
ground of the canyon. The bridge is a
steel bridge, there is a road right to the
mountain in the last shot.
A person jumping with a bike (not
20 seconds video sequence
Correct result: any keyframe of the
showed video sequence Correct result: any keyframe of the
following video sequence
Examples of correct results:
Figure 1. Examples of KIS, textual KIS and AVS tasks.
VISIONE can be used to solve both KIS and AVS tasks. It integrates several content-
based data analysis and retrieval modules, including a keyword search, a spatial object-
based search, a spatial color-based search, and a visual similarity search. The main novelty
of our system is that it employs text encodings that we specifically designed for indexing
and searching video content. This aspect of our system is crucial: we can exploit the
latest text search engine technologies, which nowadays are characterized by high efficiency
and scalability, without the need to define a dedicated data structure or even worry
about implementation issues like software maintenance or updates to new hardware
technologies, etc.
In [
] we initially introduced VISIONE by only listing its functionalities and briefly
outlining the techniques it employs. In this work, instead, we have two main goals: first,
to provide a more detailed description of all the functionalities included in VISIONE and
how each of them are implemented and combined together; second, to present an analysis
of the system retrieval performance by examining the logs acquired during the VBS2019
challenge. Therefore, this manuscript primarily presents how all the aforementioned search
functionalities are implemented and integrated into a unified framework that is based on a
full-text search engine, such as Apache Lucene (https://lucene.apache.org/, accessed on
22 April 2021); secondly, it presents an an experimental analysis for identifying the most
suitable text scoring function (ranker) for the proposed textual encoding in the context of
video search.
The rest of the paper is organized as follows. The next section reviews related works.
Section 3gives an overview of our system and its functionalities. Key notions on our
proposed textual encoding and other aspects regarding the indexing and search phases are
presented in Section 4. Section 5presents an experimental evaluation to determine which
text scoring function is the best in the context of a KIS task. Section 6draws the conclusions.
2. Related Work
Video search is a challenging problem of great interest in the multimedia retrieval
community. It employs various information retrieval and extraction techniques, such as
3 of 26
content-based image and text retrieval, computer vision, speech and sound recognition,
and so on.
In this context, several approaches for cross-modal retrieval between visual data and
text description have been proposed, such as [
], to name but a few. Many of them are
image-text retrieval methods that make use of a projection of the image features and the
text features into the same space (visual, textual or a joint space) so that the retrieval is then
performed by searching in this latent space (e.g., [
]). Other approaches are referred as
video-text retrieval methods as they learn embeddings of video and text in the same space
by using different multi-modal features (like visual cues, video dynamics, audio inputs,
and text) [
]. For example, Ref. [
] simultaneously utilizes multi-modal features
to learn two joint video-text embedding networks: one learns a joint space between text
features and visual appearance features, the other learns a joint space between text features
and a combination of activity and audio features.
Many attempts for developing effective visual retrieval systems have been done
since the 1990s, such as content-based querying system for video databases [
] or
the query by image content system presented in [
]. Many video retrieval systems are
designed in order to support complex human generated queries that may include but are
not limited to keywords or natural language sentences. Most of them are interactive tools
where the users can dynamically refine their queries in order to better specify their search
intent during the search process. The VBS contest provides a live and fair performance
assessment of interactive video retrieval systems and therefore in recent years has become
a reference point for comparing state-of-the-art video search tools. During the competition,
the participants have to perform various KIS and AVS tasks in a limited amount of time
(generally within 5–8 min for each task). To evaluate the interactive search performance
of each video retrieval system, several search sessions are performed by involving both
expert and novice users. Expert users are the developers of the in race retrieval system or
people that already know and use the system before the competition. Novices are users
who interact with the search system for the first time during the competition.
Several video retrieval systems participated at the VBS in the last years [
Most of them, including our system, support multimodal search with interactive query
formulation. The various systems differ mainly on (i) the search functionalities supported
(e.g., query-by-keyword, query-by-example, query-by-sketch, etc.), (ii) the data indexing
and search mechanisms used at the core of the system, (iii) the techniques employed
during video preprocessing to automatically annotate selected keyframes and extract
image features, (iv) the functionalities integrated into the user interface, including advanced
visualization and relevance feedback. Among all the systems that participated in VBS, we
recall VIRET [
], vitrivr [
], and SOM-Hunter [
], which won the competition in 2018,
2019, and 2020, respectively.
] is an interactive frame-based video retrieval system that currently
provides four main retrieval modules (query by keyword, query by free-form text, queries
by color sketch, and query by example). The keyword search relies on automatic annotation
of video keyframes. In the latest versions of the system, the annotation is performed using
a retrained deep Convolutional Neural Network (NasNet [
]) with a custom set of 1243
class labels. A retrained NasNet is also used to extract deep features of the images, which
are then employed for similarity search. The free-form text search is implemented by
using a variant of the W2VV++ model [
]. An interesting functionality supported by
VIRET is the temporal sequence search, which allows a user to describe more than one
frame of a target video sequence by also specifying the expected temporal ordering of the
searched frames.
Vitrivr [
] is an open-source multimedia retrieval system that supports content-based
retrieval of several media types (images, audio, 3D data, and video). For video retrieval,
it offers different query modes, including query by sketch (both visual and semantic),
query by keywords (concept labels), object instance search, speech transcription search,
and similarity search. For the query by sketch and query by example, vitrivr uses several
4 of 26
low-level image features and a Deep Neural Network pixel-wise semantic annotator [
The textual search is based on scene-wise descriptions, structured metadata, OCR, and ASR
data extracted from the videos. Faster-RCNN [
] (pre-trained on the Openimages V4
dataset) and a ResNet-50 [7] (pre-trained on ImageNet) are used to support object instance
search. The latest version of vitrivr also supports temporal queries.
SOM-Hunter [
] is an open-source video retrieval system that supports keyword
search, free-text search, and temporal search functionalities, which are implemented as
in the VIRET system. The main novelty of SOM-Hunter is that it relies on the user’s rele-
vance feedback to dynamically update the search results displayed using self-organizing
maps (SOMs).
Our system, like almost all current video retrieval systems, relies on artificial intelli-
gence techniques for automatic video content analysis (including automatic annotation
and object recognition). Nowadays, content-based image retrieval systems (CBIR) are
possible solution to the problem of retrieving and exploring a large volume of images
resulting from the exponential growth of accessible image data. Many of these systems
use both visual and textual features of the images, but often most of the images are not
annotated or only partially annotated. Since manual annotation for a large volume of
images is impractical, Automatic Image Annotation (AIA) techniques aim to bridge this
gap. For the most part, AIA approaches are based solely on the visual features of the
image using different techniques: one of the most common approaches consists in training
a classifier for each concept and obtaining the annotation results by ranking the class
probability [
]. There are other AIA approaches that aim to improve the quality of
image annotation by using the knowledge implicit in a large collection of unstructured text
describing images, and are able to label images without having to train a model (Unsuper-
vised Image Annotation approach [
]). In particular, the image annotation technique
we exploited is an Unsupervised Image Annotation technique originally introduced in [
Recently, image features built upon Convolutional Neural Networks (CNN) have
been used as an effective alternative to descriptors built using image local features, like
SIFT, ORB and BRIEF, to name but a few. CNNs have been used to perform several tasks,
including image classification, as well as image retrieval [
] and object detection [
Moreover, it has been proved that the representations learned by CNNs on specific tasks
(typically supervised) can be transferred successfully across tasks [
]. The activation of
neurons of specific layers, in particular the last ones, can be used as features to semantically
describe the visual content of an image. Tolias et al. [
] proposed the Regional Maximum
Activations of Convolutions (R-MAC) feature representation, which encodes and aggre-
gates several regions of the image in a dense and compact global image representation.
Gordo et al. [
] inserted the R-MAC feature extractor in an end-to-end differentiable
pipeline in order to learn a representation optimized for visual instance retrieval through
back-propagation. The whole pipeline is composed by a fully convolutional neural network,
a region proposal network, the R-MAC extractor and PCA-like dimensionality reduction
layers, and it is trained using a ranking loss based on image triplets. In our work, as a
feature extractor for video frames, we used a version of R-MAC that uses the ResNet-101
trained model provided by [
] as the core. This model has proven to perform best on
standard benchmarks.
Object detection and recognition techniques also provide valuable information for
semantic understanding of images and videos. In [
] the authors proposed a model for
object detection and classification, which integrates Tensor features. The latter are invariant
under spatial transformation and together with SIFT features (which are invariant to scaling
and rotation) allow improving the classification accuracy of detected objects using a Deep
Neural Network. In [
], the authors presented a cloud based system that analyses
video streams for object detection and classification. The system is based on a scalable
and robust cloud computing platform for performing automated analysis of thousands of
recorded video streams. The framework requires a human operator to specify the analysis
criteria and the duration of video streams to analyze. The streams are then fetched from a
5 of 26
cloud storage, decoded and analyzed on the cloud. The framework executes intensive parts
of the analysis on GPU-based servers in the cloud. Recently, in [
], the authors proposed
an approach that combines Deep CNN and SIFT. In particular, they extract features from
the analyzed images with both approaches, they fuse the features by using a serial-based
method that produces a matrix that is fed to ensemble classifier for recognition.
In our system, we used YOLOv3 [
] as CNN architecture to recognize and locate
objects in the video frames. The architecture of YOLOv3 jointly performs a regression of
the bounding box coordinates and classification for every proposed region. Unlike other
techniques, YOLOv3 performs these tasks in an optimized fully-convolutional pipeline that
takes pixels as input and outputs both the bounding boxes and their respective proposed
categories. This CNN has the great advantage of being particularly fast and at the same
time exhibiting remarkable accuracy. To increase the number of categories of recognizable
objects, we used three different variants of the same network trained on different data sets,
namely, YOLOv3, YOLO9000 [52], and YOLOv3 OpenImages [53].
One of the main peculiarities of our system, compared to others participating in
VBS, is that we decided to employ a full-text search engine to index and search video
content, both for the visual and textual parts. Since nowadays text search technologies
have achieved impressive performance in terms of scalability and efficiency VISIONE turns
out to be scalable. To take full advantage from these stable search engine technologies, we
specifically designed various text encodings for all the features and descriptors extracted
from the video keyframes and the user query, and we decided to use the Apache Lucene
project. In previous papers, we already exploited the idea of using text encoding, named
Surrogate Text Representation [
], to index and search image for deep features [
In VISIONE, we extend this idea to index also information regarding the position of objects
and colors that appear in the images.
3. The VISIONE Video Search Tool
VISIONE is a visual content-based retrieval system designed to support large scale
video search. It allows a user to search for a video describing the content of a scene by
formulating textual or visual queries (see Figure 2).
VISIONE, in fact, integrates several search functionalities and exploits deep learning
technologies to mitigate the semantic gap between text and image. Specifically it supports:
query by keywords : the user can specify keywords including scenes, places or concepts
(e.g., outdoor, building, sport) to search for video scenes;
query by object location: the user can draw on a canvas some simple diagrams to specify
the objects that appear in a target scene and their spatial locations;
query by color location: the user can specify some colors present in a target scene and
their spatial locations (similarly to object location above);
query by visual example: an image can be used as a query to retrieve video scenes that
are visually similar to it.
Moreover, the search results can be filtered by indicating whether the keyframes are
in color or in b/w, or by specifying its aspect ratio.
6 of 26
Search Interface
Browsing Interface
Figure 2.
A screenshot of the VISIONE User Interface composed of two parts: the search and
the browsing.
3.1. The User Interface
The VISIONE user interface is designed to be simple, intuitive and easy to use also
for users who interact with it for the first time. As shown in the screenshot represented in
Figure 2, it integrates the searching and the browsing functionalities in the same window.
The searching part of the interface (Figure 3) provides:
atext box, named “Scene tags”, where the user can type keywords describing the target
scene (e.g., “park sunset tree walk”);
acolor palette and an object palette that can be used to easily drag & drop a desired color
or object on the canvas (see below);
acanvas, where the user can sketch objects and colors that appear in the target scene
simply by drawing bounding-boxes that approximately indicate the positions of the
desired objects and colors (both selected from the palettes above) in the scene;
atext box, named “Max obj. number”, where the user can specify the maximum number
of instances of the objects appearing in the target scene (e.g.: two glasses);
two checkboxes where the user can filter the type of keyframes to be retrieved (B/W or
color images, 4:3 or 16:9 aspect ratio).
The canvas is split into a grid of 7
7 cells, where the user can draw the boxes and
then move, enlarge, reduce or delete them to refine the search. The user can select the
desired color from the palette, drag & drop it on the canvas and then resize or move the
corresponding box as desired. There are two options to insert objects in the canvas: (i)
directly draw a box in the canvas using the mouse and then type the name of the object in
a dialog box (auto-complete suggestions are shown to the user), (ii) drag & drop one of
the object icon appearing in the object palette on the canvas. For the user ’s convenience,
a selection of 38 common (frequently used) objects are included in the object palette.
7 of 26
Target Scene
Search Interface
Query by example
Filter by metadata
Query by objects/colors
Query by keywords
Figure 3.
An example of how to use the VISIONE Search Interface to find the target scene: search for
images that contain a woman in light blue t-shirt in the foreground and a man to her right (query by
object/colors), images labeled with “music, bar, cafe, dinner” (query by keywords), or images similar
to some others (query by example).
Note that when objects are inserted in the canvas (e.g., a “person” and a “car”), then
the system filters out all the images not containing the specified objects (e.g., all the scenes
without a person or without a car). However, images with multiple instances of those
objects can be returned in the search results (e.g., images with two or three people and one
or more cars). The user can use the “Max obj. number” text box to specify the maximum
number of instances of an object appearing in the target scene. For example by typing “1
person 3 car 0 dog” the system returns only images containing at most one person, three cars
and no dog.
The “Scene tags” text box provides auto-complete suggestions to the users and for
each tag also indicates the number of keyframes in the databases that are annotated with
it. For example, by typing “music” the system suggests “music (204775); musician (1374);
music hall (290); . . .”, where the numbers indicates how many images in the database are
annotated with the corresponding text (e.g., 204775 images for “music”, 1374 images for
musician”, etcetera). This information can be exploited by the user when formulating the
queries. Moreover, the keyword-based search supports wildcard matching. For example,
with “music” the system searches for any tag that starts with “music”.
Every time the user interacts with the search interface (e.g., type some text or
add/move/delete a bounding box) the system automatically updates the list of search
results, which are displayed in the browsing interface, immediately below the search panel.
In this way the user can interact with the system and gradually compose his query by also
taking into account the search results obtained so far to refine the query itself.
The browsing part of the user interface (Figure 4) allows accessing the information
associated with the video, every displayed keyframe belongs to it, a keyframe-based video
summary and playing the video starting from the selected keyframe. In this way, the user
can easily check if the selected image belongs to the searched video. The search results
can also be grouped together according to the fact that the keyframes belong to the same
video. This visualization option can be enabled/disabled by clicking on the “Group by video
checkbox. Moreover, while browsing the results, the user can use one of the displayed
images to perform an image Similarity Search and retrieve frames visually similar to the
one selected. A Similarity Search is executed by double clicking on an image displayed in
the search results.
8 of 26
Video Playback
Video Summary and
Keyframe Context
Browsing Interface
Submission Button
Used during the Video
Browsing Showdown
Figure 4.
An highlight of the Browsing Interface: for each keyframe result it allows accessing the
information such as video summary, keyframe context, play the video starting from the selected
keyframe and search similar keyframe.
3.2. System Architecture Overview
The general architecture of our system is illustrated in Figure 5. Each component of
the system will be described in detail in the following sections; here we give an overview of
how it works. To support the search functionalities introduced above, our system exploits
deep learning technologies to understand and represent the visual content of the database
videos. Specifically, it employs:
an image annotation engine, to extract scene tags (see Section 4.1);
state-of-the-art object detectors, like YOLO [
], to identify and localize objects in the
video keyframes (see Section 4.2);
spatial colors histograms, to identify dominant colors and their locations (see Sec-
tion 4.2);
the R-MAC deep visual descriptors, to support the Similarity Search functionality (see
Section 4.3)
The peculiarity of the approach used in VISIONE is to represent all the different types
of descriptors extracted from the keyframes (visual features, scene tags, colors/object
locations) with a textual encoding that is indexed in a single text search engine. This choice
allows us to exploit mature and scalable full-text search technologies and platforms for
indexing and searching video repository. In particular, VISIONE relies on the Apache
Lucene full-text search engine. The text encoding used to represent the various types of
information, associated with every keyframe, is discussed in Section 4.
Also the queries formulated by the user through the search interface (e.g., the key-
words describing the target scene and/or the diagrams depicting objects and the colors
locations) are transformed into textual encoding, in order to process them. We designed a
specific textual encoding for each typology of data descriptor as well as for the user queries.
9 of 26
Indexing phase
Visual Features
Dominant Colors
Surrogate Text
Textual Encoding of
Object and Color
Bounding Boxes
Scene Tags Textual
Searching phase
Scene Tags Textual
Textual Encoding of
Object and Color
Bounding Boxes
Visual features extraction
from the image example
Surrogate Text
BBox Search
OCclass Search Textual Encoding of
Object and Color
Visual Features
Classes field
BBoxes field
Scene Tags field
Textual Encoding of
Object and Color
Figure 5.
System Architecture: a general overview of the components of the two main phases of the
system, the indexing and the browsing.
In the full-text search engine, the information extracted from every keyframe is com-
posed of four textual fields, as shown in Figure 5:
Scene Tags, containing automatically associated tags;
Object&Color BBoxes, containing text encoding of colors and objects locations;
Object&Color Classes, containing global information on objects and colors in the
Visual Features, containing text encoding of extracted visual features.
These four fields are used to serve the four main search operations of our system:
Annotation Search, search for keyframes associated with specified annotations;
BBox Search, search for keyframes having specific spatial relationships among ob-
OClass Search, search for keyframes containing specified objects/colors;
Similarity Search, search for keyframes visually similar to a query image
The user query is broken down into three sub-queries (the first three search operations
above), and a query rescorer (the Lucene QueryRescorer implementation in our case) is
used to combine the search results of all the sub-queries. Note that the Similarity Search is
the only search operation that is stand-alone in our system: it is a functionality used only
on browsing phase. In the next section, we will describe the four search operations and
further details on the indexing and searching phases.
4. Indexing and Searching Implementation
In VISIONE, as already anticipated, content of keyframes is represented and indexed
using automatically generated annotations, positions of occurring objects, positions of
colors, and deep visual features. In the following we describe how these descriptors are
extracted, indexed, and searched.
4.1. Image Annotation
One of the most natural ways of searching in a large multimedia data set is using
a keyword-based query. To support such kind of queries, we employed our automatic
annotation system (Demo available at http://mifile.deepfeatures.org, accessed on 22 April
2021) that is introduced in [
]. This system is based on an unsupervised image annotation
approach that exploits the knowledge implicitly existing in a huge collection of unstruc-
tured texts describing images, allowing us to annotate the images without using a specified
trained model. The advantage is that the target vocabulary we used for the annotation
reflects well the way people actually describe their pictures. Specifically, our system uses
the tags and the descriptions contained in the metadata of a large set of media selected
from the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset [
]. Those tags
are validated using WordNet [
], cleaned and then used as the knowledge base for the
automatic annotation.
The subset of the YFCC100M dataset that we used for building the knowledge base
was selected by identifying images with relevant textual descriptions and tags. To this
scope, we used a metadata cleaning algorithm that leverages on the semantic similarities
10 of 26
between images. Its core idea is that if a tag is contained in the metadata of a group of very
similar images, then that tag is likely to be relevant for all these images. The similarity
between images was measured by means of visual deep features; specifically, we used
the output of the sixth layer of the neural network Hybrid-CNN (Publicly available in the
Caffe Model Zoo, http://github.com/BVLC/caffe/wiki/Model-Zoo, accessed on 22 April
2021) as visual descriptors. In a nutshell, the metadata cleaning algorithms is performed
by (1) creating an inverted index where each stem (extracted from the textual metadata)
is associated with a posting list of all the images containing that stem in their metadata;
(2) images of each posting list are clustered according to their visual similarity; (3) an
iterative algorithm is used to clean the clusters so that at the end of the process, each stem
is associated with a list of clusters of very similar images (For further details, please refer
to [39]).
As a result of our metadata cleaning algorithm we selected about 16 thousands terms
associated with about one million images. The set of deep features extracted from those
images were then indexed using the MI-file index [
] in order to allow us to access the
data and perform similarity search in a very efficient way.
The annotation engine is based on a k-NN classification algorithm. An image is
annotated with the most frequent tags associated with the most similar images in the
YFCC100M cleaned subset. The specific definition of the annotation algorithm is out of the
scope of this paper and we refer to [39] for further details.
In Figure 6, we show an example of annotation obtained with our system. Please
note that our system also provides a relevance score to each tag associated with the image.
The bigger the score the more relevant the tag. We used our annotation system to label
the video keyframes of the V3C1 dataset. For each keyframe we produce a “tag textual
encoding” by concatenating all the tags associated with the images. In order to represent
the relevance of the associated tag, each tag is repeated a number of times equal to the
relevance score of the tag itself (the relevance of each tag is approximated to an integer
using the ceiling function). The ordering of the tags in the concatenation is not important
because what matters are the tag frequencies. In Figure 6the box named Textual Document
shows an example of concatenation associated with a keyframe. The so generated textual
documents are then stored in a separate field of our index, which we referred to as Scene
Tag field (see Figure 5).
11 of 26
Guessed tag: sunset Relevance: 35.25
Guessed tag: sunrise Relevance: 7.24
Guessed tag: sun Relevance: 4.81
Guessed tag: sky Relevance: 3.22
Guessed tag: sol Relevance: 3.19
Guessed tag: view Relevance: 2.43
Guessed tag: cloud Relevance: 2.42
Guessed tag: lake Relevance: 1.62
Guessed tag: mountain Relevance: 1.61
Guessed tag: dusk Relevance: 1.61
Guessed tag: landscape Relevance: 1.61
Guessed tag: mar Relevance: 1.60
Guessed tag: beach Relevance: 1.60
beach beach cloud cloud cloud dusk dusk lake lake
landscape landscape mar mar mountain mountain sky sky
sky sky sol sol sol sol sun sun sun sun sun sunrise sunrise
sunrise sunrise sunrise sunrise sunrise sunrise sunset
sunset sunset sunset sunset sunset sunset sunset sunset
sunset sunset sunset sunset sunset sunset sunset sunset
sunset sunset sunset sunset sunset sunset sunset sunset
sunset sunset sunset sunset sunset sunset sunset sunset
sunset sunset sunset view view view
Textual Document
Figure 6.
Example of our image annotation and its representation as single textual document. In the
textual document, each tag is repeated a number of times equal to the least integer greater than or
equal to the tag relevance (e.g., beach with relevance 1.60 is repeated 2 times).
Annotation Search
The annotations, generated as described above, can be used to retrieve videos, by typ-
ing keywords in the “Scene tags” text box of the user interface (see Figure 3). As already
anticipated in Section 3.2, we call Annotation Search this searching option. The Annotation
Search is executed performing a full-text search. As described in Section 5, during the VBS
competition the Best Matching 25 (BM25) similarity was used as a text scoring function.
The textual document used to query our index is created as a space-separated concate-
nation of all the keywords typed by the user in the “Scene tags” text box. For example if the
user specifies the keywords “music” “bar” “cafe” “dinner ” then the corresponding query
document is simply” music bar cafe dinner”. Note that, as our system relies on Lucene, we
support the single and multiple character wildcard searches (e.g., “music”).
4.2. Objects and Colors
Information related to objects and colors in a keyframe are treated in a similar way in
our system. Given a keyframe we store both local and global information about objects
and colors contained in it. As we discussed in Section 3.2, the positions where objects and
colors occur are stored in the Object&Color BBoxes field; all objects and colors occurring in a
frame are stored in the Object&Color Classes field.
4.2.1. Objects
We used a combination of three different versions of YOLO to perform object detection:
YOLOv3 [
], YOLO9000 [
], and YOLOv3 OpenImages [
], to extend the number of
detected objects. The idea of using YOLO to detect objects within video has already been ex-
ploited in VBS, e.g., by Truong et al. [
]. The peculiarity of our approach is that we combine
and encode the spatial position of the detected objects in a single textual description of the
image. To obtain the spatial information, we use a grid of 7
7 cells overlaid to the image to
determine where (over which cells) each object is located. In particular, each object detected
in the image Iis indexed using a specific textual encoding
ENC = (codloc codclass)
that puts
together the location
codloc ={colrow|col [a
row [
1, 2, 3, 4, 5, 6, 7
and the class
codclass ={
strings for objects
corresponding to the object. The textual encod-
ing of this information is created as follows. For each image, we have a space-separated
12 of 26
concatenation of ENCs, one for all the cells (
) in the grid that contains the object
): for example, for the image in Figure 7the rightmost car is indexed with the
3car . . .
where “car” is the
of the object car, located in cells
5. This information is stored in the Object&Color BBoxes field
of the record associated with the keyframe. In addition to the position of objects, we also
maintain global information about the objects contained in a keyframe, in terms of number
of occurrences of each object detected in the image (see Figure 7). Occurrences of objects in
a keyframe are encoded by repeating the object (
) as many times as the number of
the occurrences (
) of the object itself. This information is stored using an encoding
that composes the classes with their occurrences in the image: (
). For example,
in Figure 7, YOLO detected 2 persons, 3 cars, which are also classified as vehicle by the
detector, and 1 horse, also classified as animal and mammal, and this results in the Object
Classes encoding as “person1 person2 vehicle1 vehicle2 vehicle3 car1 car2 car3 mammal1 horse1
animal1”. This information is stored in the Object&Color Classes field of the record associated
with the keyframe.
a3car a3car a3vehicle a3vehicle a4car a4vehicle b1person b2person b3person b3car b3vehicle b4car b4vehicle c1person
c1person c2person c2person c2horse c2mammal c2animal c3person c3person c3horse c3mammal c3animal c4horse
c4mammal c4animal c5horse c5mammal c5animal d2horse d2mammal d2animal d3horse d3mammal d3animal d4horse
d4mammal d4animal d5horse d5mammal d5animal e2horse e2mammal e2animal e3horse e3mammal e3animal e3car
e3vehicle e4horse e4mammal e4animal e4car e4vehicle e5horse e5mammal e5animal e5car e5vehicle f3car f3vehicle
f4car f4vehicle f5car f5vehicle g3car g3vehicle g4car g4vehicle g5car g5vehicle
Textual Encoding of Object Bounding Boxes
person1 person2 vehicle1 vehicle2 vehicle3 car1 car2 car3 mammal1 horse1 animal1
Textual Encoding of Object Classes
Figure 7.
Example of our textual encoding for objects and their spatial locations (second box): the
information that the cell a3 contains two cars is encoded as the concatenation a3car a3car . In addition
to the position we encode also the number of occurrences of the object in the image (first box): the
two person are encoded as person1 person2.
4.2.2. Colors
To represent colors, we use a palette of 32 colors (https://lospec.com/palette-list,
accessed on 22 April 2021) which represents a good trade-off between the huge miscellany
of colors and simplicity of choice for the user at search time. For the creation of the color
textual encoding we used the same approach employed to encode the object classes and
locations, using the same grid of 7
7 cells. To assign the colors to each cell of the grid we
used the following approach. We first evaluate the color of each pixel by using the CIELAB
color space. Then, we map the evaluated color of the pixel to our 32-colors palette. To do
so, we perform a
-NN similarity search between the evaluated color and our 32 colors
to find the colors in our palette that most match the color of the current pixel. The metric
used for this search is the Earth Mover’s Distance [
]. We take into consideration the
first two colors in
-NN results. The first color is assigned to that pixel. We then compute
the ratio between the scores of the two colors and if it is greater than 0.5 then we also
assign the second color to that pixel. This is done to allow matching of very similar colors
13 of 26
during searching. We repeat this for each pixel of a cell in the grid and then we sum the
occurrences of each color of our palette for all the pixels in the cell. Finally, we assign to that
cell all the colors whose occurrence is greater than 7% of the number of pixels contained in
the cell. So more than one color may be assigned to a single cell. This redundancy helps
reduce misclassified colors from what they appear to the human eye.
The colors assigned to all the 7
7 cells are then encoded into two textual documents,
one for the color locations and one for the global color information, using the same approach
employed to encode object classes and locations, and discussed in Section 4.2.1. Specifically,
the textual document associated to the color location is obtained by concatenating textual
encodings of the form
codloc codcl ass
, where
is an identifier of a cell and
is the
identifier of a color assigned to the cell. This information is stored in the Object&Color
BBoxes field. The textual document for the color classes is obtained by concatenating the
text identifiers (
) of all the colors assigned to the image. This information is stored
in the Object&Color Classes field of the record associated with the keyframe.
Object and Color Location Search
At run-time phase, the search functionalities for both the query by object and color
location are implemented using two search operations: the bounding box search (BBox
Search) and the object/color-class search (OClass Search).
The user can draw a bounding box in a specific position of the canvas and specify
which object/color wants to found in that position, or he/she can drag & drop a particular
object/color from the palette in the user interface and resize the corresponding bounding
box as desired (as shown in the “Query by object/colors” of Figure 3). All the bounding
boxes present in the canvas, both related to objects and colors, are then converted into
the two textual encoding described respectively in Sections 4.2.1 and 4.2.2. As a matter of
fact, the canvas on the user interface is exactly the grid of 7
7 cells already mentioned
(on the encodings explication), for which, we made as query the concatenation of all
the cells’ encodings (both objects and colors) and also their occurrences, calculated as
described above.
For the actual search phase, first an instance of the OClass Search operator is executed.
This operator tries to find a match between all the objects/colors represented in the canvas
and the frames stored in the index that contains these objects/colors. In other words,
the textual encoding of the Object&Color Classes generated from the canvas is used as
query document for a full-text search on the Object&Color Classes field of our index. This
search operation produces a result set containing a subset of the dataset with all the
frames that match the objects/colors drawn by the user in the canvas. After this, the BBox
Search operator performs a rescoring of the result set by matching the textual encoding of
the Object and Color Bounding Boxes encoding of the query with all the corresponding
encodings in the index (stored in the Object&Color BBoxes field). The metric used in this case
during the VBS competition was BM25. After the execution of these two search operators,
the frames that satisfied these two searches ordered by descending score are shown in the
browsing part of the user interface.
4.3. Deep Visual Features
VISIONE also supports content-based visual search functionality, i.e., it allows users to
retrieve keyframes visually similar to a query image given by example. In order to represent
and compare the visual content of the images, we use the Regional Maximum Activations
of Convolutions (R-MAC) [
], which is a state-of-art descriptor for image retrieval. The
R-MAC descriptor effectively aggregates several local convolutional features (extracted
at multiple positions and scales) into a dense and compact global image representation.
We use the ResNet-101 trained model provided by Gordo et al. [
] as an R-MAC feature
extractor since it achieved the best performance on standard benchmarks. The used
descriptors are 2048-dimensional real-valued vectors.
14 of 26
To efficiently index the R-MAC descriptors, we transform the deep features into a
textual encoding suitable for being indexed by a standard full-text search engine. We used
the Scalar Quantization-based Surrogate Text representation to transform the deep features into
a textual encoding, which was proposed in [
]. The idea behind this approach is to map
the real-valued vector components of the R-MAC descriptor into a (sparse) integer vector
that acts as the term frequencies vector of a synthetic codebook. Then the integer vector
is transformed into a text document by simply concatenating some synthetic codewords
so that the term frequency of the
-th codeword is exactly the
-th element of the integer
vector. For example, the four-dimensional integer vector
2, 1, 0, 1
is encoded with the text
4”, where {τ
4}is a codebook of four synthetic alphanumeric terms.
The overall process used to transform an R-MAC descriptors into a textual encoding
is summarized in Figure 8(for simplicity, the R-MAC descriptor is depicted as a 10-
dimensional vector). The mapping of the deep features into the term frequencies vectors
is designed (i) to preserve as much as possible the rankings, i.e., similar features should
be mapped into similar term frequencies vectors (for effectiveness) and (ii) to produce
sparse vectors, since each data object will be stored in as many posting lists as the non-zero
elements in its term frequencies vector (for efficiency). To this end, the deep features are
first centered using their mean and then rotated using a random orthogonal transformation.
The random orthogonal transformation is particularly useful to distribute the variance
over all the dimensions of the vector as it provides good balancing for high dimensional
vectors without the need to search for an optimal balancing transformation. In this way,
we try to increase the cases where the dimensional components of the features vectors have
the same mean and variance, with mean equal to zero. Moreover the used roto-traslation
preserves the rankings according to the dot-product (see [
] for more details). Since
search engines, like the one we used, use an inverted file to store the data, as a second
step, we have to sparsify the features. Sparsification guarantees the efficiency of these
indexes. To achieve this, Scalar Quantization approach maintains components above a
certain threshold by zeroing all the others and quantizing the non-zero elements to integer
values. To deal with negative values the Concatenated Rectified Linear Unit (CReLU)
transformation [
] is applied before the thresholding. Note that the CReLU simply makes
an identical copy of vector elements, negates it, concatenates both original vector and its
negation, and then zeros out all the negative values. As the last operation, we apply the
Surrogate Text Representation technique [
] that transforms an integer vector into a textual
representation by associating each dimension of the vector with a unique alphanumeric
keyword, and by concatenating those keywords into a textual sequence such that the
keyword is repeated a number of time equal to the
-th value of the vector. For example,
for a given a
-dimensional integer vector
v= [v1
. . .
we use a synthetic codebook of
. . .
, where each
is a distinct alphanumeric word (e.g.,
2=B”, etc.). The vector vis then transformed into the text
1. . . τ
| {z }
. . . . . . τN. . . τN
| {z }
where, by abuse of notation, we denote the space-separated concatenation of keywords
with the union operator .
15 of 26
vector ∈ ℝ10
vector ∈ ℝ+
vector ∈ ℕ+
𝜏3𝜏3𝜏4𝜏4𝜏12...𝜏12 𝜏18 𝜏18 𝜏19... 𝜏19
Surrogate Text Representation
using the Codebook {𝜏1, … , 𝜏20 }
21 times 45 times 26 times 20 times 31 times
Figure 8.
Scalar Quantization-based Surrogate Text representation: clockwise the transformation of
the image R-MAC descriptor- depicted as a 10-dimensional vector- into a textual encoding called
Surrogate Text Representation.
The advantage of scalar quantization-based surrogate texts is that we can transform a
visual feature into a text document that can be indexed by classical text retrieval techniques
for efficiency reasons keeping all the semantic power of the original visual feature. In
VISIONE the Surrogate Text Representation of a dataset image is stored in the “Visual
Features” field of our index (Figure 5).
Similarity Search
VISIONE relies on the Surrogate text encodings of images to perform the Similarity
Search. When the user starts a Similarity Search by selecting a keyframe in the browsing
interface, the system retrieves all the indexed keyframes whose Surrogate Text Representa-
tion are similar to the Surrogate Text Representation of the selected keyframe. We used the
dot product over the frequency terms vectors (TF ranker) as text similarity function since it
achieved very good performance for large-scale image retrieval task [56].
4.4. Overview of the Search Process
As we described so far, our system relies on four search operations: an Annotation
Search, a BBox Search, an OClass Search, and a Similarity Search. Figure 9sketches the
main phases of the search process involving the first three search operations, as described
hereafter. Every time a user interacts with the VISIONE interface (add/remove/update
a bounding box, add/remove a keyword, click on an image, etc. . . .), a new query
executed, where
is the sequence of the instances of search operations currently active in
the interface. The query is then split into subqueries, where a subquery contains instances
of a single search operation. In a nutshell, the system runs all the subqueries using the
appropriate search operation and then combines the search results using a sequence of
reordering. In particular, we designed the system so the OClass Search operation has the
priority: the result set contains all the images which match the given query with taking
into account the classes drawn in the canvas (both object and colors), and not their spatial
location. If the query includes also some scene tags (text box of the user interface), then the
Annotation Search is performed but only on the result set generated by the first OClass
Search. So in this case the Annotation Search actually produces only a rescoring of the
results obtained at the previous step. Finally, another rescore is performed using the BBox
Search. If the user does not issue any annotation keyword in the interface, only the OClass
Search and BBox Search are used. If, on the other hand, only one or more keywords are
put in the interface, only the Annotation Search is used to find the results. Lastly, if certain
filters (black and white, color, aspect ratio, etc.) have been activated by the user, the results
are filtered before being displayed in the browsing interface.
The Similarity Search is the only search operation that is stand-alone in our system,
i.e., it is never combined with other search operations, and it is performed every time the
16 of 26
user double-clicks on an image displayed in the browsing interface. However, we observe
that in future versions of VISIONE it may be interesting to also include the possibility of
using Similarity Search to reorder the results obtained from other search operations.
Scene Tags Textual Representation :
“family puppy”
Classes field
Search Scene Tags field
BBoxes field
Search Operations
Result filtering
(only color images with
max 2 humanface)
User Interface
Textual Encoding of Object and Color
Bounding Boxes:
“c2humanface c3humanface e2humanface
e3humanface d5dog d6dog 4ecolor15”
Figure 9.
Outline of the search process: the query formulated by the user in the search interface is
translated into three textual subqueries. Each subquery is the used by a specific search operation to
query our index. The search operations are not performed simultaneously, but in sequence following
the order indicated by the numbers in the figure. Search operators in intermediate steps are not
performed on the whole dataset but rather on the result set of the search operation performed in the
previous step.
5. Evaluation
As already discussed in Sections 3and 4, a user query is executed as a combination of
search operations (Annotation Search, BBox Search, OClass Search, and Similarity Search).
The final result set returned to the user highly depends on the results returned by each
executed search operation. Each search operation is implemented in Apache Lucene using a
specific ranker that determines how the textual encoding of the database items are compared
with the textual encoding of the query in order to provide the ranked list of results.
In our first implementation of the system, used at the VBS competition in 2019, we
tested for each search operation various rankers, and we estimated the performance of the
system using our personal experience and feeling. Specifically, we tested a set of queries
with different rankers and we select the ranker that provided us with good results in the top
positions of the returned items. However, given the lack of a ground truth, this qualitative
analysis was based on a subjective feedback provided by a member of our team who
explicitly looked at the top-returned images obtained with the various tested scenarios,
and judged how good the results were.
17 of 26
After the competition, we decided to have a more accurate approach to estimate the
performance of the system, and the results of this analysis are discussed in this section. As
the choice of the rankers strongly influences the performance of the system, we decided
to have a more in-depth and objective analysis based on this part of the system. The final
scope of this analysis is finding for our system the best rankers combination. Intuitively,
the best combination of rankers is the one that, on average, puts more often good results
(that is target results for the search challenge) at the top of the result list. Specifically, we
used the query logs acquired during the participation at the challenge. The logs store all the
sequences of search operations that were executed as consequence of users interacting with
the system. By using these query logs, we were able to re-execute the same user sessions
using different rankers. In this way we objectively measured the performance of the system,
obtained when the same sequence of operation was executed with different rankers.
We focus mainly on the rankers for the BBox Search, OClass Search, and Annotation
Search. We do not consider the Similarity Search as it is an independent search operation in
our system, and previous work [
] already proved that the dot product (TF ranker) works
well with the surrogate text encodings of the R-MAC descriptors, which are the features
adopted in our system for the Similarity Search.
5.1. Experiment Design and Evaluation Methodology
As anticipated before, our analysis makes use of the log of queries executed during
the 2019 VBS competition. The competition was divided in three content search tasks:visual
KIS,textual KIS and AVS, already described in Section 1. For each task, a series of runs is
executed. In each run, the users are requested to find one or more target videos. When
the user believes that he/she has found the target video, he/she submits the result to the
organization team that evaluates the submission.
After the competition, the organizers of VBS provided us with the VBS2019 server
dataset that contains all the tasks issued at the competition (target video/textual description,
start/end time of target video for KIS tasks, and ground-truth segments for KIS tasks),
the client logs for all the systems participating to the competition, and the submissions
made by the various teams. We used the ground-truth segments and the log of the queries
submitted to our system to evaluate the performance of our system under different settings.
We restricted the analysis only to the logs related to textual and visual KIS tasks since
ground-truths for AVS tasks were not available. Please note that for the AVS tasks the
evaluation of the correctness of the results submitted by each team during the competition
was made on site by members of a jury who evaluated the submitted images one by one.
For these tasks, in fact, a predefined ground-truth is not available.
During the VBS competition a total of four users (two experts and two novices)
interacted with our system to solve 23 tasks (15 visual KIS and 8 textual KIS). The total
number of queries executed on our system for those tasks was 1600 (We recall that, in our
system, a new query is executed at each interaction of a user with the search interface).
In our analysis, we considered four different rankers to sort the results obtained
by each search operation of our system. Specifically we tested the rankers based on the
following text scoring function:
BM25: Lucene’s implementation of the well-known similarity function BM25 intro-
duced in [64];
TFIDF: Lucene’s implementation of the weighing scheme known as Term Frequency-
Inverse Document Frequency introduced in [65];
TF: implementation of dot product similarity over the frequency terms vector;
NormTF: implementation of cosine similarity (the normalized dot product of the two
weight vectors) over the frequency terms vectors.
Since we consider three search operations and four rankers, we have a total of 64
possible combinations. We denote each combination with a triplet
is the ranker used for the BBox Search,
is the ranker used for the Annotation
Search, and
is the ranker used for the OClass Search. In the implementation of
18 of 26
VISIONE used at the 2019 VBS competition, we employed the combination BM25-BM25-TF.
With the analysis reported in this section, we compare all the different combinations in
order to find the one that is most suited for the video search task.
For the analysis reported in this section we went through the logs and automatically
re-executed all the queries using the 64 different combinations of rankers in order to find
the one that, with the highest probability, finds a relevant result (i.e., a keyframe in the
ground-truth) in the top returned results. Each combination was obtained by selecting a
specific ranker (among BM25, NormTF, TF, and TFIDF) for each search operation (BBox
Search, Annotation Search, and OClass Search).
Evaluation Metrics
During the competition the user has to retrieve a video segment from the database
using the functionalities of the system. A video segment is composed of various keyframes,
which can be significantly different from one another, see Figure 10 as an example.
Figure 10.
Example of the ground-truth keyframes for a 20 second video clip used as a KIS task
at VBS2019. During the competition, our team correctly found the target video by formulating a
query describing one of the keyframes depicting a lemon. However, note that most of the keyframes
in the ground-truth were not relevant for the specific query submitted to our system as only three
keyframes (outlined in green in the figure) represented the scene described in the query (yellow
lemon in the foreground).
In our analysis, we assume that the user stops examining the ranked result list as soon
as he/she finds one relevant result, that is one of the keyframes belonging to the target
video. Therefore, given that relevant keyframes can be significantly different one from the
other, we do not take into account the rank position of all the keyframes composing the
ground-truth of a query, as required for performance measures like Mean Average Precision
or Discounted Cumulative Gain. We want to measure how the system is good at proposing
in the top position at least one of the target keyframes.
In this respect, we use the Mean Reciprocal Rank-MRR (Equation (1)) as a quality
measure, since it allows us to evaluate how good is the system in returning at least one
relevant result (one of the keyframes of the target video) in top position of the result set.
Formally, given a set
of queries, for each
. . .
the ground-truth,
i.e., the set of nqkeyframes of the target video-clip searched using the query q; we define:
as the rank of the image
in the ranked results returned by our system
after executing the query q
as the rank of the first correct result in the ranked result list
for the query q.
19 of 26
The Mean Reciprocal Rank for the query set Qis given by
MRR =1
RR(q), (1)
where the Reciprocal Rank (RR) for a single query qis defined as
RR(q) = (0 no relevant results
1/rqotherwise (2)
We evaluated the MRR for each different combination of rankers. Moreover, as we
expect that a user inspects just a small portion of the results returned in the browsing
interface, we also evaluate the performance of each combination in finding at least one
correct result in the top
positions of the result list (
can be interpreted as the maximum
number of images inspected by a user). To this scope we computed the MRR at position
RR@k(q) = (0rq>kOR no relevant results
1/rqotherwise (4)
In the experiments we consider values of
smaller than 1000, with a focus on values
between 1 and 100 as we expect cases where a user inspects more than 100 results to be
less realistic.
5.2. Results
In our analysis, we used
521 queries (out of 1600 above mentioned) to calculate
. In fact the rest of the queries executed on our system during the
VBS2019 competition are not eligible for our analysis since they are not informative to
choose the best ranker configuration:
about 200 queries involved the execution of a Similarity Search, a video summary
or a filtering, whose results are independent of the rankers used in the three search
operations considered in our analysis;
the search result sets of about 800 queries do not contain any correct result due to the
lack of alignment between the text associated with the query and the text associated
with images relevant to the target video. For those cases, the system is not able to
display the relevant images in the result set regardless of the ranker used. In fact,
the effect of using a specific ranker only affects the ordering of the results and not the
actual selection of them.
Figure 11 reports the MRR values of all 64 combinations. We computed the Fisher’s
randomization test with 100,000 random permutations as non parametric significance test,
which accordingly to Smucker et al. [
] is particularly appropriate to evaluate whether
two approaches differ significantly. As the baseline we used the ranker combination
employed during VBS2019 (i.e., BM25-BM25-TF) and in the Figure 11 we marked with
* all the approaches for which the MRR is significantly different from the baseline with
the two-sided
value lower than
0.05. Note that the combination that we used at
VBS2019 (indicated with diagonal lines in the graph), and that was chosen according to
subjective feelings, has a good performance, but it is not the best. In fact, we noticed that
there exist some patterns in the combinations of the rankers used for the OClass Search and
the Annotation Search which are particularly effective and some which, instead, provide
us with very poor results. For example, the combinations that use TF for the OClass
Search and BM25 for the Annotation Search gave us the overall best results. While the
combinations that use BM25 for the OClass Search and the NormTF for the Annotation
20 of 26
Search have the worse performance. Specifically, we have a MRR of 0.023 for the best
(NormTF-BM25-TF) and 0.004 for the worst (BM25-NormTF-BM25), which results in a
relative improvement of the MRR of 475%. Moreover, the best combination has a relative
improvement of 38% over the baseline used at the VBS2019. These results give us evidence
that an appropriate choice of rankers is crucial for system performance. Moreover, a further
analysis of the MRR results, it turned out quite clearly that for the Annotation Search the
ranker BM25 is particularly effective, while the use of the TF ranker highly degrades
the performance. To analyze further the results obtained focusing on the performance
for each search operation (Figure 12): we calculate the MRR values obtained for a fixed
ranker while varying the rankers used with the other search operations. To ulterior new
evidence, the results depicted in Figure 11 where the best combination is given from the
combination of NormTF for the BBox Search, BM25 for the Annotation Search and the TF
for OClass Search was confirmed by the specific results conducted on search operations:
for BBox Search the NormTF is on average the best choice, for Annotation Search the
BM25 is significantly the best and for OClass Search the TF is on average the best. It also
turned out that for the BBox Search, on average, the rankers followed the same trend,
for Annotation Search the NormTF had evident oscillations, and for OClass Search the TF
is less susceptible to fluctuations.
Furthermore, to complete the analysis on the performance of the rankers, we analyze
the MMR@k, where
is the parameter that controls how many results are shown to the user
in the results set. The results are reported in Figure 13, where we varied
between 1 and
1, 000, and in Table 1, where results for some representative values of
are reported. In order
to facilitate the reading of results, we focused the analysis only on eight combinations: the
four with the best MMR@k, the four with the worst MMR@k, and the configuration used
at VBS2019. The latter is also used as baselines to evaluate the statistical significance of
the results according to Fisher’s randomization test. Approaches for which the MRR@k is
significantly different from the MRR@k of the baseline are marked with * in Table 1. We
observed that the configuration NormTF-BM25-TF perform the best for all the tested
however the improvement over the VBS2019 baseline is statistically significant only for
k10, that is the case where the user inspects more than 10 results.
In conclusion, we identified the combination NormTF-BM25-TF as the best one, pro-
viding a relative improvement of 38% in
and 40% in
@100 with respect to the
setting previously used at the VBS competition.
Mean Reciprocal Rank
Figure 11. MRR
of the 64 combinations of ranker: the one filled with diagonal lines is the combination
used at the VBS2019 competition. Each configuration is denoted with a triplet
is the ranker used for the BBox Search,
is the ranker used for the Annotation Search,
is the ranker used for the OClass Search. Statistically significant results with two-sided
pvalue lower than 0.05 over the baseline BM25-BM25-TF are marked with * in the graph.
21 of 26
* _ TF _ NormTF
* _ NormTF _ BM25
* _ TF _ BM25
* _ TFIDF _ BM25
* _ TFIDF _ NormTF
* _ TF _ TF
* _ NormTF _ TF
* _ BM25 _ NormTF
* _ NormTF _ TFIDF
* _ BM25 _ TFIDF
* _ NormTF _ NormTF
* _ BM25 _ BM25
* _ BM25 _ TF
BBox Search
NormTF _ * _ NormTF
NormTF _ * _ TFIDF
BM25 _ * _ BM25
BM25 _ * _ TFIDF
TFIDF _ * _ NormTF
BM25 _ * _ TF
BM25 _ * _ NormTF
TFIDF _ * _ BM25
NormTF _ * _ BM25
TF _ * _ NormTF
TF _ * _ BM25
TF _ * _ TF
NormTF _ * _ TF
Annotation Search
BM25 _ NormTF _ *
BM25 _ TF _ *
TF _ NormTF _ *
TF _ TF _ *
BM25 _ TFIDF _ *
TFIDF _ TF _ *
NormTF _ TF _ *
TF _ TFIDF _ *
TFIDF _ NormTF _ *
NormTF _ NormTF _ *
BM25 _ BM25 _ *
NormTF _ TFIDF _ *
TF _ BM25 _ *
TFIDF _ BM25 _ *
NormTF _ BM25 _ *
OClass Search
Figure 12. MRR
varying the ranker for each search operation. The * stand in for the ranker depicted
on chart. A graph callout is used in each chart to mark the point corresponding to the combination
used at VBS2019.
22 of 26
110 100 1000
BM25_BM25_TF (VBS 2019)
Figure 13.
MRR@k for eight combinations of the rankers (the four best, the four worst and the setting
used at VBS2019) varying kfrom 1 to 1000.
Table 1.
MRR@k for eight combinations of the rankers (the four best, the four worst and thesetting
used at VBS2019) varying k.Statistically significant results with two-sided
value lower than 0.05
over the baseline BM25-BM25-TF are marked with *.
k=1k=5k=10 k=50 k=100 k=500 k=1000
NormTF-BM25-TF 0.015 0.017 0.019 * 0.022 * 0.022 * 0.023 * 0.023 *
TFIDF-BM25-TF 0.013 0.016 0.018 * 0.021 * 0.022 * 0.022 * 0.022 *
TF-BM25-TF 0.013 0.016 0.017 0.018 * 0.019 * 0.019 * 0.019 *
TF-BM25-BM25 0.013 0.015 0.016 0.017 0.017 * 0.018 * 0.018 *
TF-BM25-NormTF 0.013 0.015 0.016 0.017 * 0.017 * 0.018 * 0.018 *
BM25-BM25-TF (VBS 2019) 0.013 0.014 0.015 0.016 0.016 0.016 0.017
NormTF-TF-NormTF 0.000 * 0.001 * 0.003 * 0.004 * 0.004 * 0.005 * 0.005 *
NormTF-NormTF-BM25 0.000 * 0.001 * 0.002 * 0.004 * 0.004 * 0.005 * 0.005 *
BM25-NormTF-BM25 0.002 * 0.002 * 0.002 * 0.003 * 0.003 * 0.004 * 0.004 *
TFIDF-NormTF-BM25 0.000 * 0.001 * 0.001 * 0.003 * 0.004 * 0.004 * 0.004 *
5.3. Efficiency and Scalability Issues
As we stated in the introduction, the fact that the retrieval system proposed in this
article is built on top of a text search engine guarantees in principle efficiency and scalability
of queries. This has been practically verified by obtaining average response times of less
than a second for all types of queries (even more complex ones). On the scalability of
the system, we can make some optimistic assumptions because we have not conducted
experiments on it. This optimistic assumption is based on the observation that if the
“synthetic” documents generated for visual search by similarity, and for the localization
of objects and colors behave as textual documents then the scalability of our system is
comparable to that of commercial Web search engines. To this end, with regard to the
scalability of visual similarity as we rely on the technique used to index R-MAC descriptors
based on scalar quantization, the reader is referred to the work [
], in which the scalability
of this approach is proven. On the other hand, as far as objects and colors are concerned,
we have analyzed the sparsity of the inverted index corresponding to synthetic documents
and we have seen that it is around 99.78%. Moreover, since the queries are similar in length
to those of natural language search scenarios (i.e., they have few terms), the scalability of
the system is guaranteed at least as much as that of full-text search engine scenarios.
23 of 26
6. Conclusions
In this paper, we described a frame-based interactive video retrieval system, named
VISIONE, that participated to the Video Browser Showdown contest in 2019. VISIONE
includes several retrieval modules and supports complex multi-modal queries, including
query by keywords (tags), query by object/color location, and query by visual example.
A demo of VISIONE running on the VBS V3C1 dataset is publicly available at (http:
//visione.isti.cnr.it/, accessed on 22 April 2021).
VISIONE exploits a combination of artificial intelligence techniques to automatically
analyze the visual content of the video keyframes and extract annotations (tags), informa-
tion on objects and colors appearing in the keyframes (including the spatial relationship
among them), and deep visual descriptors. A distinct aspect of our system is that all
these extracted features are converted into specifically designed text encodings that are
then indexed using a full-text search engine. The main advantage of this approach is that
VISIONE can exploit the latest search engine technologies, which today guarantee high
efficiency and scalability.
The evaluation reported in this work shows that the effectiveness of the retrieval
is highly influenced by the text scoring function (ranker) used to compare the textual
encodings of the video features. In fact, by performing an extensive evaluation of the
system under several combinations, we observed that an optimal choice of the ranker used
to sort the search results can improve the performance in terms of Mean Reciprocal Rank
up to an order of magnitude. Specifically, for our system we found out that TF,NormTF,
and BM25, are particularly effective for comparing textual representations of object/color
classes, object/color bounding boxes, and tags, respectively.
Author Contributions:
Conceptualization, C.G., G.A., and L.V.; methodology, C.G., G.A., and L.V.;
software, G.A., P.B., F.C., and F.D.; validation, C.G., and L.V.; formal analysis, F.D., and L.V.; investi-
gation, C.G., P.B., and L.V.; data curation, C.V., P.B.; writing—original draft preparation, P.B., F.D.,
F.F, C.G., L.V. and C.V.; writing—review and editing, G.A., F.D., C.G., L.V. and C.V.; visualization,
F.D., L.V.; supervision, G.A.; funding acquisition, G.A., F.F. All authors have read and agreed to the
published version of the manuscript.
This research was partially supported by H2020 project AI4EU under GA 825619, by H2020
project AI4Media under GA 951911, by “Smart News: Social sensing for breaking news”, CUP CIPE
D58C15000270008, by VISECH ARCO-CNR, CUP B56J17001330004, and by “Automatic Data and
documents Analysis to enhance human-based processes” (ADA), CUP CIPE D55F17000290009.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement:
The V3C1 dataset, which consists of 7475 video files, amounting for
1000h of video content (1082659 predefined segments), is publicly available. In order to down-
load the dataset (which is provided by NIST), please follow the instruction repeorted at (https:
//videobrowsershowdown.org/call-for-papers/existing-data-and-tools/, accessed on 22 April
We gratefully acknowledge the support of NVIDIA Corporation with the dona-
tion of the Tesla K40 GPU used for this research.
Conflicts of Interest: The authors declare no conflict of interest.
Rossetto, L.; Gasser, R.; Lokoc, J.; Bailer, W.; Schoeffmann, K.; Muenzer, B.; Soucek, T.; Nguyen, P.A.; Bolettieri, P.; Leibetseder, A.;
et al. Interactive Video Retrieval in the Age of Deep Learning - Detailed Evaluation of VBS 2019. IEEE Trans. Multimed.
23, 243–256, doi:10.1109/TMM.2020.2980944.
Cobârzan, C.; Schoeffmann, K.; Bailer, W.; Hürst, W.; Blažek, A.; Lokoˇc, J.; Vrochidis, S.; Barthel, K.U.; Rossetto, L. Interactive
video search tools: A detailed analysis of the video browser showdown 2015. Multimed. Tools Appl.
,76, 5539–5571,
24 of 26
Lokoˇc, J.; Bailer, W.; Schoeffmann, K.; Muenzer, B.; Awad, G. On influential trends in interactive video retrieval: Video Browser
Showdown 2015–2017. IEEE Trans. Multimed. 2018,20, 3361–3376, doi:10.1109/TMM.2018.2830110.
Berns, F.; Rossetto, L.; Schoeffmann, K.; Beecks, C.; Awad, G. V3C1 Dataset: An Evaluation of Content Characteristics. In
Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; Association
for Computing Machinery: New York, NY, USA, 2019; pp. 334–338, doi:10.1145/3323873.3325051.
Amato, G.; Bolettieri, P.; Carrara, F.; Debole, F.; Falchi, F.; Gennaro, C.; Vadicamo, L.; Vairo, C. VISIONE at VBS2019. In Lecture
Notes in Computer Science, Proceedings of the MultiMedia Modeling, Thessaloniki, Greece, 8–11 January 2019; Springer International
Publishing: Cham, Switzerland, 2019; pp. 591–596, doi:10.1007/978-3-030-05716-9_51.
Hu, P.; Zhen, L.; Peng, D.; Liu, P. Scalable deep multimodal learning for cross-modal retrieval. In Proceedings of the 42nd
International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019;
pp. 635–644.
Liu, Y.; Albanie, S.; Nagrani, A.; Zisserman, A. Use what you have: Video retrieval using representations from collaborative
experts. arXiv 2019, arXiv:1907.13487.
Mithun, N.C.; Li, J.; Metze, F.; Roy-Chowdhury, A.K. Learning joint embedding with multimodal cues for cross-modal video-text
retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, Yokohama, Japan, 11–14 June
2018; pp. 19–27.
Otani, M.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Yokoya, N. Learning joint representations of videos and sentences with web
image search. In European Conference on Computer Vision; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016;
pp. 651–667.
Zhen, L.; Hu, P.; Wang, X.; Peng, D. Deep supervised cross-modal retrieval. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10394–10403.
Sclaroff, S.; La Cascia, M.; Sethi, S.; Taycher, L. Unifying textual and visual cues for content-based image retrieval on the world
wide web. Comput. Vis. Image Underst. 1999,75, 86–98.
Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; Mikolov, T. Devise: A deep visual-semantic embedding
model. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013;
pp. 2121–2129.
Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv
2014, arXiv:1411.2539.
Karpathy, A.; Joulin, A.; Fei-Fei, L.F. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the
Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1889–1897.
Dong, J.; Li, X.; Snoek, C.G. Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv
Miech, A.; Zhukov, D.; Alayrac, J.B.; Tapaswi, M.; Laptev, I.; Sivic, J. Howto100m: Learning a text-video embedding by watching
hundred million narrated video clips. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea,
27 October–2 November 2019; pp. 2630–2640.
Pan, Y.; Mei, T.; Yao, T.; Li, H.; Rui, Y. Jointly modeling embedding and translation to bridge video and language. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4594–4602.
Xu, R.; Xiong, C.; Chen, W.; Corso, J.J. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a
Unified Framework. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015;
Volume 5, p. 6.
La Cascia, M.; Ardizzone, E. Jacob: Just a content-based query system for video databases. In Proceedings of the 1996 IEEE
International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996;
Volume 2, pp. 1216–1219.
Marques, O.; Furht, B. Content-Based Image and Video Retrieval; Springer Science & Business Media: Berlin/Heidelberg, Germany,
2002; Volume 21.
21. Patel, B.; Meshram, B. Content based video retrieval systems. arXiv 2012, arXiv:1205.1641.
Faloutsos, C.; Barber, R.; Flickner, M.; Hafner, J.; Niblack, W.; Petkovic, D.; Equitz, W. Efficient and effective querying by image
content. J. Intell. Inf. Syst. 1994,3, 231–262.
Schoeffmann, K. Video Browser Showdown 2012–2019: A Review. In Proceedings of the 2019 International Conference on
Content-Based Multimedia Indexing (CBMI), Dublin, Ireland, 4–6 September 2019; pp. 1–4, doi:10.1109/CBMI.2019.8877397.
Lokoˇc, J.; Kovalˇcík, G.; Münzer, B.; Schöffmann, K.; Bailer, W.; Gasser, R.; Vrochidis, S.; Nguyen, P.A.; Rujikietgumjorn, S.;
Barthel, K.U. Interactive Search or Sequential Browsing? A Detailed Analysis of the Video Browser Showdown 2018. ACM Trans.
Multimed. Comput. Commun. Appl. 2019,15, doi:10.1145/3295663.
Lokoˇc, J.; Kovalˇcík, G.; Sou ˇcek, T. Revisiting SIRET Video Retrieval Tool. In International Conference on Multimedia Modeling;
Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; pp. 419–424, doi:10.1007/978-3-319-73600-6_44.
Rossetto, L.; Amiri Parian, M.; Gasser, R.; Giangreco, I.; Heller, S.; Schuldt, H. Deep Learning-Based Concept Detection in vitrivr.
In International Conference on Multimedia Modeling; Lecture Notes in Computer Science; Springer International Publishing: Cham,
Switzerland, 2019; pp. 616–621, doi:10.1007/978-3-030-05716-9_55.
25 of 26
Kratochvíl, M.; Veselý, P.; Mejzlík, F.; Lokoˇc, J. SOM-Hunter: Video Browsing with Relevance-to-SOM Feedback Loop. In
International Conference on Multimedia Modeling; Lecture Notes in Computer Science; Springer International Publishing: Cham,
Switzerland, 2020; pp. 790–795, doi:10.1007/978-3-030-37734-2_71.
Lokoˇc, J.; Koval ˇcík, G.; Souˇcek, T. VIRET at Video Browser Showdown 2020. In International Conference on Multimedia Modeling;
Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 784–789, doi:10.1007/978-3-
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8697–8710.
Li, X.; Xu, C.; Yang, G.; Chen, Z.; Dong, J. W2VV++ Fully Deep Learning for Ad-hoc Video Search. In Proceedings of the 27th
ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1786–1794.
Sauter, L.; Amiri Parian, M.; Gasser, R.; Heller, S.; Rossetto, L.; Schuldt, H. Combining Boolean and Multimedia Retrieval in
vitrivr for Large-Scale Video Search. In International Conference on Multimedia Modeling; Lecture Notes in Computer Science;
Springer International Publishing: Cham, Switzerland, 2020; pp. 760–765, doi:10.1007/978-3-030-37734-2_66.
32. Rossetto, L.; Gasser, R.; Schuldt, H. Query by Semantic Sketch. arXiv 2019, arXiv:1909.12526.
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In
Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99.
Chang, E.Y.; Goh, K.; Sychay, G.; Wu, G. CBSA: Content-based soft annotation for multimodal image retrieval using Bayes point
machines. IEEE Trans. Circuits Syst. Video Technol. 2003,13, 26–38, doi:10.1109/TCSVT.2002.808079.
Carneiro, G.; Chan, A.; Moreno, P.; Vasconcelos, N. Supervised Learning of Semantic Classes for Image Annotation and Retrieval.
IEEE Trans. Pattern Anal. Mach. Intell. 2007,29, 394–410, doi:10.1109/TPAMI.2007.61.
Barnard, K.; Forsyth, D. Learning the semantics of words and pictures. In Proceedings of the Eighth IEEE International Conference
on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume II, pp. 408–415, doi:10.1109/ICCV.2001.937654.
Li, X.; Uricchio, T.; Ballan, L.; Bertini, M.; Snoek, C.G.M.; Bimbo, A.D. Socializing the Semantic Gap: A Comparative Survey on
Image Tag Assignment, Refinement, and Retrieval. ACM Comput. Surv. 2016,49, doi:10.1145/2906152.
Pellegrin, L.; Escalante, H.J.; Montes, M.; González, F. Local and global approaches for unsupervised image annotation. Multimed.
Tools Appl. 2016,76, 16389–16414., doi:10.1007/s11042-016-3918-9.
Amato, G.; Falchi, F.; Gennaro, C.; Rabitti, F. Searching and annotating 100M Images with YFCC100M-HNfc6 and MI-File. In
Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, Florence, Italy, 19–21 June 2017; pp.
26:1–26:4, doi:10.1145/3095713.3095740.
Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; Darrell, T. DeCAF: A Deep Convolutional Activation Feature
for Generic Visual Recognition. arXiv 2013, arXiv:1310.1531.
Babenko, A.; Slesarev, A.; Chigorin, A.; Lempitsky, V. Neural codes for image retrieval. In European Conference on Computer Vision;
Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; pp. 584–599, doi:10.1007/978-3-319-10590-1_38.
Razavian, A.S.; Sullivan, J.; Carlsson, S.; Maki, A. Visual instance retrieval with deep convolutional networks. arXiv
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587, doi:10.1109/CVPR.2014.81.
Razavian, A.S.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN features off-the-shelf: An astounding baseline for recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE Computer Society, Columbus,
OH, USA, 23–28 June 2014; pp. 512–519, doi:10.1109/CVPRW.2014.131.
Tolias, G.; Sicre, R.; Jégou, H. Particular object retrieval with integral max-pooling of CNN activations. arXiv
Gordo, A.; Almazan, J.; Revaud, J.; Larlus, D. End-to-End Learning of Deep Visual Representations for Image Retrieval. Int. J.
Comput. Vis. 2017,124, 237–254, doi:10.1007/s11263-017-1016-8.
Najva, N.; Bijoy, K.E. SIFT and tensor based object detection and classification in videos using deep neural networks. Procedia
Comput. Sci. 2016,93, 351–358, doi:10.1016/j.procs.2016.07.220.
Anjum, A.; Abdullah, T.; Tariq, M.; Baltaci, Y.; Antonopoulos, N. Video stream analysis in clouds: An object detection
and classification framework for high performance video analytics. IEEE Trans. Cloud Comput.
,7, 1152–1167,
Yaseen, M.U.; Anjum, A.; Rana, O.; Hill, R. Cloud-based scalable object detection and classification in video streams. Future
Gener. Comput. Syst. 2018,80, 286–298, doi:10.1016/j.future.2017.02.003.
Rashid, M.; Khan, M.A.; Sharif, M.; Raza, M.; Sarfraz, M.M.; Afza, F. Object detection and classification: A joint selection and
fusion strategy of deep convolutional neural network and SIFT point features. Multimed. Tools Appl.
,78, 15751–15777,
51. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271, doi:10.1109/CVPR.2017.690.
26 of 26
Redmon, J.; Farhadi, A. YOLOv3 on the Open Images Dataset. 2018. Available online: https://pjreddie.com/darknet/yolo/
(accessed on 28 February 2019).
Gennaro, C.; Amato, G.; Bolettieri, P.; Savino, P. An approach to content-based image retrieval based on the Lucene search
engine library. In International Conference on Theory and Practice of Digital Libraries; Lecture Notes in Computer Science; Springer:
Berlin/Heidelberg, Germany, 2010; pp. 55–66, doi:10.1007/978-3-642-15464-5_8.
Amato, G.; Bolettieri, P.; Carrara, F.; Falchi, F.; Gennaro, C. Large-Scale Image Retrieval with Elasticsearch. In Proceeding of the
41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July
2018; pp. 925–928, doi:10.1145/3209978.3210089.
Amato, G.; Carrara, F.; Falchi, F.; Gennaro, C.; Vadicamo, L. Large-scale instance-level image retrieval. Inf. Process. Manag.
102100, doi:10.1016/j.ipm.2019.102100.
Amato, G.; Carrara, F.; Falchi, F.; Gennaro, C. Efficient Indexing of Regional Maximum Activations of Convolutions using
Full-Text Search Engines. In Proceedings of the ACM International Conference on Multimedia Retrieval, ACM, Bucharest,
Romania, 6–9 June 2017; pp. 420–423, doi:10.1145/3078971.3079035.
Thomee, B.; Shamma, D.A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; Li, L.J. YFCC100M: The New Data in
Multimedia Research. Commun. ACM 2016,59, 64–73, doi:10.1145/2812802.
Miller, G. WordNet: An Electronic Lexical Database; Language, speech, and communication; MIT Press: Cambridge, MA, USA, 1998.
Amato, G.; Gennaro, C.; Savino, P. MI-File: Using inverted files for scalable approximate similarity search. Multimed. Tools Appl.
2012,71, 1333–1362, doi:10.1007/s11042-012-1271-1.
Truong, T.D.; Nguyen, V.T.; Tran, M.T.; Trieu, T.V.; Do, T.; Ngo, T.D.; Le, D.D. Video Search Based on Semantic Extraction and
Locally Regional Object Proposal. In International Conference on Multimedia Modeling; Lecture Notes in Computer Science; Springer:
Cham, Switzerland, 2018; pp. 451–456, doi:10.1007/978-3-319-73600-6_49.
Rubner, Y.; Guibas, L.; Tomasi, C. The Earth Mover ’s Distance, MultiDimensional Scaling, and Color-Based Image Retrieval. In
Proceedings of the ARPA Image Understanding Workshop, New Orleans, LA, USA, 11–14 May 1997; Volume 661, p. 668.
Shang, W.; Sohn, K.; Almeida, D.; Lee, H. Understanding and improving convolutional neural networks via concatenated
rectified linear units. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 200-22 June
2016; Volume 48, pp. 2217–2225.
Robertson, S.E.; Walker, S.; Jones, S.; Hancock-Beaulieu, M.; Gatford, M. Okapi at TREC-3. In Proceedings of the Third Text
REtrieval Conference, TREC 1994, Gaithersburg, MD, USA, 2–4 November 1994; National Institute of Standards and Technology
(NIST), Gaithersburg, MD, USA: 1994; Volume 500–225, pp. 109–126.
65. Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972,28, 11–21.
Smucker, M.D.; Allan, J.; Carterette, B. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In
Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, Lisboa, Portugal, 6–8
November 2007; Association for Computing Machinery: New York, NY, USA, 2007; pp. 623–632, doi:10.1145/1321440.1321528.
... A first release of VISIONE [1,6], which participated in the 2019 edition of the Video Browser Showdown (VBS) [11], is described in details in [2]. VBS is an international video search competition that is held annually since 2012 [13]. ...
... One of the main peculiarity of our system is that all the different descriptors extracted from the video keyframes (features, scene tags, colors/object classes and locations) as well as the queries formulated by the user through the search interface (e.g., keywords describing the target scene and/or diagrams depicting objects and colors locations) are encoded using specifically-designed textual representations (see [2] for the details). This choice allows us to exploit mature and scalable full-text search technologies for indexing and searching large-scale video database without the need to implement dedicated access methods. ...
... The extracted cross-modal features are normalized and in principle very similar to visual descriptors like RMAC [14]. Hence we indexed them using the same textual encoding that we already exploited to index the RMAC descriptors (see [2,3]). ...
This paper presents the second release of VISIONE, a tool for effective video search on large-scale collections. It allows users to search for videos using textual descriptions, keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. One of the main features of our system is that it employs specially designed textual encodings for indexing and searching video content using the mature and scalable Apache Lucene full-text search engine.
... A detailed description of all the functionalities included in VISIONE and how each of them are implemented is provided in [2]. Moreover, in [2] we presented an analysis of the system retrieval performance, by examining the logs acquired during the VBS 2019 challenge. ...
... A detailed description of all the functionalities included in VISIONE and how each of them are implemented is provided in [2]. Moreover, in [2] we presented an analysis of the system retrieval performance, by examining the logs acquired during the VBS 2019 challenge. ...
... Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Franca Debole, Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo, Claudio Vairo arXiv:2008.02749. [2] In this paper, we describe VISIONE, a video search system that allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and satisfy user needs. ...
Technical Report
Full-text available
The Artificial Intelligence for Media and Humanities laboratory (AIMH) has the mission to investigate and advance the state of the art in the Artificial Intelligence field, specifically addressing applications to digital media and digital humanities, and taking also into account issues related to scalability. This report summarize the 2020 activities of the research group.
... In this paper, we aim at describing the latest version of VISIONE for participating to the Video Browser Showdown (VBS) [10,17]. The first version of the tool [1,2] and the second [3] participated in previous editions of the competition, VBS 2019 and VBS 2021, respectively. VBS is an international video search competition that is held annually since 2012 and comprises three tasks, consisting of visual and textual known-item search (KIS) and ad-hoc video search (AVS) [10,17]. ...
... Moreover, we significantly revised the color palette and the extraction of colors used for our color-based search. Previously, as indicated in [2,3], we used a color palette consisting of 32 colors, and we classified the color of each image pixel using a k-nearest neighbor search between the actual pixel color and the colors in our palette. Nevertheless, many studies in the field of anthropology, visual psychology, and linguistics pointed out that some colors appear more memorable than others and that some basic color terms are used consistently and with consensus in different languages [5,7,19]. ...
VISIONE is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). It uses a full-text search engine as a search backend. In the latest version of our system, we modified the user interface, and we made some changes to the techniques used to analyze and search for videos.
... • the organization of the 1st International Workshop on Learning to Quantify (LQ 2021) 1 [21], which has taken place in November 2021 as an online event; ...
... A detailed description of all the functionalities included in VISIONE and how each of them are implemented is provided in [2]. Moreover, in [1] we presented an analysis of the system retrieval performance, by examining the logs acquired during the VBS 2019 challenge. ...
Technical Report
Full-text available
The Artificial Intelligence for Media and Humanities laboratory (AIMH) has the mission to investigate and advance the state of the art in the Artificial Intelligence field, specifically addressing applications to digital media and digital humanities, and taking also into account issues related to scalability. This report summarize the 2021 activities of the research group.
... For color or semantic sketches, vitrivr supports a plethora of features [51,56], VERGE clusters to twelve predefined colors using the Color Layout MPEG-7 descriptor, and HTW uses a handcrafted low-level feature [18]. VIREO [42] and VISIONE [1] also support sketch search, with VISIONE extracting dominant colors with pretrained color hash tables [5,72] and objects using pretrained neural networks [47,48,78]. CollageHunter allows image collages, which enable localization of example image queries on a canvas. ...
... In VISIONE, all modalities are mapped to text, which allows the usage of Apache Lucene as a search backend. Each modality is a sub-query and the Lucene QueryRescorer combines their search results [1]. In contrast, vitrivr uses a specialized database allowing vector, text and Boolean retrieval [15]. ...
Full-text available
The Video Browser Showdown addresses difficult video search challenges through an annual interactive evaluation campaign attracting research teams focusing on interactive video retrieval. The campaign aims to provide insights into the performance of participating interactive video retrieval systems, tested by selected search tasks on large video collections. For the first time in its ten year history, the Video Browser Showdown 2021 was organized in a fully remote setting and hosted a record number of sixteen scoring systems. In this paper, we describe the competition setting, tasks and results and give an overview of state-of-the-art methods used by the competing systems. By looking at query result logs provided by ten systems, we analyze differences in retrieval model performances and browsing times before a correct submission. Through advances in data gathering methodology and tools, we provide a comprehensive analysis of ad-hoc video search tasks, discuss results, task design and methodological challenges. We highlight that almost all top performing systems utilize some sort of joint embedding for text-image retrieval and enable specification of temporal context in queries for known-item search. Whereas a combination of these techniques drive the currently top performing systems, we identify several future challenges for interactive video search engines and the Video Browser Showdown competition itself.
... There are several comparable interactive frame-based video retrieval engines that are similar to VIVA [25]. Tools such as SOMHunter [21], VERGE [3], Visione [1], vitrivr [9], VIREO [28] combine different automatic content analysis methods to facilitate search with different feature modalities. In particular, these tools support searching for objects, concepts or actions based on pre-defined classes and models built with publicly available data sets, such as ImageNet [5] or OpenImages [22]. ...
Full-text available
Video retrieval methods, e.g., for visual concept classification, person recognition, and similarity search, are essential to perform fine-grained semantic search in large video archives. However, such retrieval methods often have to be adapted to the users’ changing search requirements: which concepts or persons are frequently searched for, what research topics are currently important or will be relevant in the future? In this paper, we present VIVA, a software tool for building content-based video retrieval methods based on deep learning models. VIVA allows non-expert users to conduct visual information retrieval for concepts and persons in video archives and to add new people or concepts to the underlying deep learning models as new requirements arise. For this purpose, VIVA provides a novel semi-automatic data acquisition workflow including a web crawler, image similarity search, as well as review and user feedback components to reduce the time-consuming manual effort for collecting training samples. We present experimental retrieval results using VIVA for four use cases in the context of a historical video collection of the German Broadcasting Archive based on about 34,000 h of television recordings from the former German Democratic Republic (GDR). We evaluate the performance of deep learning models built using VIVA for 91 GDR specific concepts and 98 personalities from the former GDR as well as the performance of the image and person similarity search approaches.
... Video search engines, such as [2,3,4] developed by the AIMH Lab [1], would benefit from sketches image analysis. Integrating the propose approach with them is a future work. ...
Full-text available
The adoption of an appropriate approximate similarity search method is an essential prereq-uisite for developing a fast and efficient CBIR system, especially when dealing with large amount ofdata. In this study we implement a web image search engine on top of a Locality Sensitive Hashing(LSH) Index to allow fast similarity search on deep features. Specifically, we exploit transfer learningfor deep features extraction from images. Firstly, we adopt InceptionV3 pretrained on ImageNet asfeatures extractor, secondly, we try out several CNNs built on top of InceptionV3 as convolutionalbase fine-tuned on our dataset. In both of the previous cases we index the features extracted within ourLSH index implementation so as to compare the retrieval performances with and without fine-tuning.In our approach we try out two different LSH implementations: the first one working with real numberfeature vectors and the second one with the binary transposed version of those vectors. Interestingly,we obtain the best performances when using the binary LSH, reaching almost the same result, in termsof mean average precision, obtained by performing sequential scan of the features, thus avoiding thebias introduced by the LSH index. Lastly, we carry out a performance analysis class by class in terms ofrecall againstmAPhighlighting, as expected, a strong positive correlation between the two.
Approximate search for high-dimensional vectors is commonly addressed using dedicated techniques often combined with hardware acceleration provided by GPUs, FPGAs, and other custom in-memory silicon. Despite their effectiveness, harmonizing those optimized solutions with other types of searches often poses technological difficulties. For example, to implement a combined text+image multimodal search, we are forced first to query the index of high-dimensional image descriptors and then filter the results based on the textual query or vice versa. This paper proposes a text surrogate technique to translate real-valued vectors into text and index them with a standard textual search engine such as Elasticsearch or Apache Lucene. This technique allows us to perform approximate kNN searches of high-dimensional vectors alongside classical full-text searches natively on a single textual search engine, enabling multimedia queries without sacrificing scalability. Our proposal exploits a combination of vector quantization and scalar quantization. We compared our approach to the existing literature in this field of research, demonstrating a significant improvement in performance through preliminary experimentation.KeywordsSurrogate text representationInverted indexApproximate searchHigh-dimensional indexingVery large databases
In the ongoing multimedia age, search needs become more variable and challenging to aid. In the area of content-based similarity search, asking search engines for one or just a few nearest neighbours to a query does not have to be sufficient to accomplish a challenging search task. In this work, we investigate a task type where users search for one particular multimedia object in a large database. Complexity of the task is empirically demonstrated with a set of experiments and the need for a larger number of nearest neighbours is discussed. A baseline approach for finding a larger number of approximate nearest neighbours is tested, showing potential speed-up with respect to a naive sequential scan. Last but not least, an open efficiency challenge for metric access methods is discussed for datasets used in the experiments.
Nowadays, popular web search portals enable users to find available images corresponding to a provided free-form text description. With such sources of example images, a suitable composition/collage of images can be constructed as an appropriate visual query input to a known-item search system. In this paper, we investigate a querying approach enabling users to search videos with a multi-query consisting of positioned example images, so-called collage query, depicting expected objects in a searched scene. The approach relies on images from external search engines, partitioning of preselected representative video frames, relevance scoring based on deep features extracted from images/frames, and is currently integrated into the open-source version of the SOMHunter system providing additional browsing capabilities.
Full-text available
This paper presents a prototype video retrieval engine focusing on a simple known-item search workflow, where users initialize the search with a query and then use an iterative approach to explore a larger candidate set. Specifically, users gradually observe a sequence of displays and provide feedback to the system. The displays are dynamically created by a self organizing map that employs the scores based on the collected feedback, in order to provide a display matching the user preferences. In addition, users can inspect various other types of specialized displays for exploitation purposes, once promising candidates are found.
Conference Paper
Full-text available
Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose a fully deep learning method for query representation learning. The proposed method requires no explicit concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple yet important changes, W2VV++ brings in a substantial improvement. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (in-fAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.
Conference Paper
Full-text available
In this work we analyze content statistics of the V3C1 dataset, which is the first partition of theVimeo Creative Commons Collection (V3C). The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content characteristics, and will serve as evaluation basis for the Video Browser Showdown 2019-2021 and TREC Video Retrieval (TRECVID) Ad-Hoc Video Search tasks 2019-2021. The dataset comes with a shot segmentation (around 1 million shots) for which we analyze content specifics and statistics. Our research shows that the content of V3C1 is very diverse, has no predominant characteristics and provides a low self-similarity. Thus it is very well suited for video retrieval evaluations as well as for participants of TRECVID AVS or the VBS.
Despite the fact that automatic content analysis has made remarkable progress over the last decade - mainly due to significant advances in machine learning - interactive video retrieval is still a very challenging problem, with an increasing relevance in practical applications. The Video Browser Showdown (VBS) is an annual evaluation competition that pushes the limits of interactive video retrieval with state-of-the-art tools, tasks, data, and evaluation metrics. In this paper, we analyse the results and outcome of the 8th iteration of the VBS in detail. We first give an overview of the novel and considerably larger V3C1 dataset and the tasks that were performed during VBS 2019. We then go on to describe the search systems of the six international teams in terms of features and performance. And finally, we perform an in-depth analysis of the per-team success ratio and relate this to the search strategies that were applied, the most popular features, and problems that were experienced. A large part of this analysis was conducted based on logs that were collected during the competition itself. This analysis gives further insights into the typical search behavior and differences between expert and novice users. Our evaluation shows that textual search and content browsing are the most important aspects in terms of logged user interactions. Furthermore, we observe a trend towards deep learning based features, especially in the form of labels generated by artificial neural networks. But nevertheless, for some tasks, very specific content-based search features are still being used. We expect these findings to contribute to future improvements of interactive video search systems.
This paper presents the most recent additions to the vitrivr multimedia retrieval stack made in preparation for the participation to the 9 Video Browser Showdown (VBS) in 2020. In addition to refining existing functionality and adding support for classical Boolean queries and metadata filters, we also completely replaced our storage engine by a new database called Cottontail DB. Furthermore, we have added support for scoring based on the temporal ordering of multiple video segments with respect to a query formulated by the user. Finally, we have also added a new object detection module based on Faster-RCNN and use the generated features for object instance search.
During the last three years, the most successful systems at the Video Browser Showdown employed effective retrieval models where raw video data are automatically preprocessed in advance to extract semantic or low-level features of selected frames or shots. This enables users to express their search intents in the form of keywords, sketch, query example, or their combination. In this paper, we present new extensions to our interactive video retrieval system VIRET that won Video Browser Showdown in 2018 and achieved the second place at Video Browser Showdown 2019 and Lifelog Search Challenge 2019. The new features of the system focus both on updates of retrieval models and interface modifications to help users with query specification by means of informative visualizations.
The great success of visual features learned from deep neural networks has led to a significant effort to develop efficient and scalable technologies for image retrieval. Nevertheless, its usage in large-scale Web applications of content-based retrieval is still challenged by their high dimensionality. To overcome this issue, some image retrieval systems employ the product quantization method to learn a large-scale visual dictionary from a training set of global neural network features. These approaches are implemented in main memory, preventing their usage in big-data applications. The contribution of the work is mainly devoted to investigating some approaches to transform neural network features into text forms suitable for being indexed by a standard full-text retrieval engine such as Elasticsearch. The basic idea of our approaches relies on a transformation of neural network features with the twofold aim of promoting the sparsity without the need of unsupervised pre-training. We validate our approach on a recent convolutional neural network feature, namely Regional Maximum Activations of Convolutions (R-MAC), which is a state-of-art descriptor for image retrieval. Its effectiveness has been proved through several instance-level retrieval benchmarks. An extensive experimental evaluation conducted on the standard benchmarks shows the effectiveness and efficiency of the proposed approach and how it compares to state-of-the-art main-memory indexes.
Conference Paper
Cross-modal retrieval aims to enable flexible retrieval across different modalities. The core of cross-modal retrieval is how to measure the content similarity between different types of data. In this paper, we present a novel cross-modal retrieval method, called Deep Supervised Cross-modal Retrieval (DSCMR). It aims to find a common representation space, in which the samples from different modalities can be compared directly. Specifically, DSCMR minimises the discrimination loss in both the label space and the common representation space to supervise the model learning discriminative features. Furthermore, it simultaneously minimises the modality invariance loss and uses a weight sharing strategy to eliminate the cross-modal discrepancy of multimedia data in the common representation space to learn modality-invariant features. Comprehensive experimental results on four widely-used benchmark datasets demonstrate that the proposed method is effective in cross-modal learning and significantly outperforms the state-of-the-art cross-modal retrieval methods.
Conference Paper
Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to be involved during the whole training process. For these approaches, the optimal parameters of different modality-specific transformations are dependent on each other and the whole model has to be retrained when handling samples from new modalities. In this paper, we present a novel cross-modal retrieval method, called Scalable Deep Multimodal Learning (SDML). It proposes to predefine a common subspace, in which the between-class variation is maximized while the within-class variation is minimized. Then, it trains $m$ modality-specific networks for $m$ modalities (one network for each modality) to transform the multimodal data into the predefined common subspace to achieve multimodal learning. Unlike many of the existing methods, our method can train different modality-specific networks independently and thus be scalable to the number of modalities. To the best of our knowledge, the proposed SDML could be one of the first works to independently project data of an unfixed number of modalities into a predefined common subspace. Comprehensive experimental results on four widely-used benchmark datasets demonstrate that the proposed method is effective and efficient in multimodal learning and outperforms the state-of-the-art methods in cross-modal retrieval.