ChapterPDF Available

Abstract and Figures

Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. Nowadays, the rapid increase of video data has paved the way to the advancement of the technologies in many different communities for the creation of Content-Based Video Indexing and Retrieval (CBVIR). However, greater attention needs to be devoted to the development of effective tools for video search and browse. In this paper, we present Visione, a system for large-scale video retrieval. The system integrates several content-based analysis and retrieval modules, including a keywords search, a spatial object-based search, and a visual similarity search. From the tests carried out by users when they needed to find as many correct examples as possible, the similarity search proved to be the most promising option. Our implementation is based on state-of-the-art deep learning approaches for content analysis and leverages highly efficient indexing techniques to ensure scalability. Specifically, we encode all the visual and textual descriptors extracted from the videos into (surrogate) textual representations that are then efficiently indexed and searched using an off-the-shelf text search engine using similarity functions.
Content may be subject to copyright.
An Image Retrieval System for Video?
Paolo Bolettieri, Fabio Carrara, Franca Debole, Fabrizio Falchi, Claudio
Gennaro, Lucia Vadicamo, and Claudio Vairo
Institute of Information Science and Technologies,
Italian National Research Council (CNR),
Via G. Moruzzi 1, Pisa, Italy
Abstract. Since the 1970’s the Content-Based Image Indexing and Re-
trieval (CBIR) has been an active area. Nowadays, the rapid increase of
video data has paved the way to the advancement of the technologies
in many different communities for the creation of Content-Based Video
Indexing and Retrieval (CBVIR). However, greater attention needs to be
devoted to the development of effective tools for video search and browse.
In this paper, we present Visione, a system for large-scale video retrieval.
The system integrates several content-based analysis and retrieval mod-
ules, including a keywords search, a spatial object-based search, and a
visual similarity search. From the tests carried out by users when they
needed to find as many correct examples as possible, the similarity search
proved to be the most promising option. Our implementation is based on
state-of-the-art deep learning approaches for content analysis and lever-
ages highly efficient indexing techniques to ensure scalability. Specifically,
we encode all the visual and textual descriptors extracted from the videos
into (surrogate) textual representations that are then efficiently indexed
and searched using an off-the-shelf text search engine using similarity
Keywords: content-based image indexing ·neural networks ·multime-
dia retrieval·similarity search ·object detection
1 Introduction
Video data is the fastest growing data type on the Internet, and because of the
proliferation of high-definition video cameras, the volume of video data is ex-
ploding. Visione [1] is a content-based video retrieval system that participated
for the first time in 2019 to the Video Browser Showdown (VBS) [11], an inter-
national video search competition that evaluates the performance of interactive
video retrievals systems. The VBS 2019 uses the V3C1 dataset that consists
of 7,475 video files, amounting for 1000h of video content (1082659 predefined
Springer Nature Switzerland AG 2019
G. Amato et al. (Eds.): SISAP 2019, LNCS 11807, pp. 332–339, 2019.
Final authenticated publication:
?Work partially supported by the AI4EU project (EC-H2020 - Contract n. 825619)
2 P. Bolettieri et al.
segments) [15] and encompasses three content search tasks: visual Known-Item
Search (visual KIS),textual Known-Item Search (textual KIS) and ad-hoc Video
Search (AVS). The visual KIS task models the situation in which someone wants
to find a particular video clip that he has already seen, assuming that it is con-
tained in a specific collection of data. In the textual KIS, the target video clip is
no longer visually presented to the participants of the challenge but it is rather
described in details by text. This task simulates situations in which a user wants
to find a particular video clip, without having seen it before, but knowing the
content of the video exactly. For the AVS task, instead, a textual description
is provided (e.g. “A person playing guitar outdoors”) and participants need to
find as many correct examples as possible, i.e. video shots that fit the given
In this paper, we describe the current version of Visione, an image retrieval
system used to search for videos, presented for the first time at VBS2019. After
the first implementation of the system, as described in [1], we decide to focus
our attention on the query phase, by improving the user interaction with the
interface. And for that reason, we introduce a set of icons for the object location
and, inspired by other system involved in VBS of the previous years (e.g. [10]),
we integrate the query-by-color sketch. In the next sections, we describe the
main components of the system and the techniques at the bottom of the system.
2 System Components
Visione is based on state-of-the-art deep learning approaches for the visual con-
tent analysis and exploits highly efficient indexing techniques to ensure scalabil-
ity. In Visione, we use the keyframes made available by the VBS organizers (1
million segments and keyframes1), focusing our work on the extraction of rel-
evant information on these keyframes and on the design of a clear and simple
user interface.
In the following, we give a brief description of the main components of the
system: the User Interface and the Search Engine (see Figure 1).
2.1 User Interface
The user interface, shown in the upper part of Figure 1, provides a text box to
specify the keywords, and a canvas for sketching objects to be found in the target
video. Inspired by one of the system on VBS2018, we integrate also the query-by-
color sketches, realized with the same interface we used for the objects (canvas
and bounding box). The canvas is split into a grid of 7×7 cells, where the user can
draw simple bounding boxes to specify the location of the desired objects/colors.
The user can move, enlarge or reduce the drawn bounding boxes for refining the
search. In the current version of the system, we realize a simple drag & drop on
the canvas using icons for the most common objects. Furthermore with the same
An Image Retrieval System for Video 3
Fig. 1. The main components of Visione: the User Search Interface, and the Indexing
and Retrieval System
mechanism we define a color palette available as icons, to facilitate the search
by color: for each cell of the grid (7×7), we calculate the dominant colors using
a K-NN approach, largely adopted in color based image segmentation [13].
Moreover, another new functionality added to the old system [1], is the possi-
bility of using some filters, such as the number of occurrences of specific objects,
and the type of keyframes to be retrieved (B/W or color, 4:3 or 16:9). At brows-
ing time, the user browsing through the results can use the image similarity
to refine the search, or group the keyframes (in the result set) that belong to
the same video. Finally, the user interface offers the possibility to show for each
keyframe of the result set, all the keyframes of the video of the selected keyframe,
and play the video starting from the selected keyframe: this can help to check if
the selected keyframe matches the query. A standard search in Visione, for all the
tasks, could be done by drawing one or more bounding boxes of objects/colors,
or by searching for some keywords, and often by combining them.
4 P. Bolettieri et al.
2.2 Search Engine
Retrieval and browsing require that the source material is first of all effectively
indexed. In our case, we employ state-of-the-art deep learning approaches to
extract both low-level and semantic visual features. We encode all the features
extracted from the keyframes (visual features, keywords, object locations, and
metadata) into textual representations that are then indexed using inverted files.
We use a text surrogate representation [6], which was specifically extended to
support efficient spatial object queries on large scale data. In this way, it is
possible to build queries by placing the objects to be found in the scene and
efficiently search for matching images in an interactive way. This choice allows us
to exploit efficient and scalable search technologies and platform used nowadays
for text retrieval. In particular, Visione relies on the Apache Lucene2.
In the next section, we describe in detail the techniques employed to obtain
useful visual/semantic features.
3 Methodologies
Visione addresses the issues of CBVIR modeling the data using both the simple
features (color, texture) and derived features (semantic features). Regarding the
derived features, Visione relies heavily on deep learning techniques, trying to
bridge the semantic gap between text and image using the following approaches:
for keywords search: we exploit an image annotation system based on
different Convolutional Neural Networks to extract scene attributes.
for object location search: we exploit the efficiency of YOLO3as a real-
time object detection system to retrieve the video shot containing the objects
sketched by the user.
for visual similarity search: we perform a similarity search by computing
the similarity between the visual features represented using the R-MAC [17]
visual descriptor.
Keywords. Convolutional Neural Networks, used to extract the deep features,
are able to associate images with categories they are trained from, but quite
often, these categories are insufficient to associate relevant keywords/tags to
an image. For that reason, then, Visione exploits an automatic annotation sys-
tem to annotate untagged images. This system, as described in [2], is based
on YFCC100M-HNfc6, a set of deep features extracted from the YFCC100M
dataset [16], created using the Caffe framework [8]. The image annotation sys-
tem is based on an unsupervised approach to extract the implicitly existing
knowledge in the huge collection of unstructured texts describing the images of
YFCC100M dataset, allowing us to label the images without using a training
model. The image annotation system also exploits the metadata of the images
validated using WordNet [5].
An Image Retrieval System for Video 5
Fig. 2. Search Engine: the index for both object location and keywords.
Object Location. Following the idea that the human eye is able to identify
objects in the image very quickly, we decide to take advantage of the new tech-
nologies available to search for object instances in order to retrieve the exacted
video shot.
For this purpose, we use YOLOv3 [14] as object detector, both because it is
extremely fast and because of its accuracy. Our image query interface is subdi-
vided into a 7 ×7 grid in the same way that YOLO segments images to detect
objects. Each object detected in the single image Iby YOLO is indexed using
a specific encoding ENC conceived to put together the location and the class
corresponding to the object (codloccodclass ). The idea of using YOLO to detect
objects within video has already been exploited in VBS, e.g. by [18], but our ap-
proach is distinguished by being able to encode the class and the location of the
objects in a single textual description of the image, allowing us to search using
a standard text search engine. Basically for each image Ientry on the index,
we have a space-separated concatenation of ENC s, one for all the possible cells
(codloc) in the grid that contains the object (codclass) where:
loc is the juxtaposition of row and col on the grid
class is the name of the object as classified by YOLO.
In practice, through the UI the users can draw the objects they are looking
for by specifying the desired location for each of them (e.g., tree and vehicle in
Figure 1). Meanwhile, for each object, the UI encodes appropriately the request
to interrogate the index, marking all the cells in the grid that contain the object.
For example, for the query in Figure 1, we will search for entries Ion our index
that contain the sequence p1tree p2tree ... p6tree, where piis the code of the
i-th cell (with 1 i6 since the tree icon covers six cells). Note that, a cell
of a sketch can contain multiple objects. As showed in Figure 2, for the image
with id 2075, we extract both keywords (beach, cloud, etcetera), using the image
annotation tool, and object location (3dperson, etcetera), exploiting the object
detector, and later we index these two features in a single Lucene index.
Visual Similarity. Visione also supports visual content-based search function-
alities, which allows users to retrieve scenes containing keyframes visually similar
to a query image given by example. To start the search the user can select any
6 P. Bolettieri et al.
keyframe of a video as query. In order to represent and compare the visual con-
tent of the images, we use the Regional Maximum Activations of Convolutions
(R-MAC) [17]. This descriptor effectively aggregates several local convolutional
features (extracted at multiple position and scales) into a dense and compact
global image representation. We use the ResNet-101 trained model provided
by [7] as feature extractor since it achieved the best performance on standard
benchmarks. To efficiently index the R-MAC descriptor, we transform the deep
features into a textual encoding suitable for being indexed by a standard full-text
search engine, such as Lucene: we first use the Deep Permutation technique [3]
to encode the deep features into a permutation vector, which is then transformed
into a Surrogate Text Representation (STR) as described in [6]. The advantage
of using a textual encoding is that we can efficiently exploit off-the-shelf text
search engines for performing image searches on large scale.
4 Results
For the evaluation of our system, we took advantage of the participation to the
VBS competition, which was a great opportunity to test the system with both
expert and novice users.4For each task, a team receives a score based on response
time and on the number of correct and incorrect submissions.
KIS tasks. During the competition, the strategy used for solving both the
KIS tasks was mainly based on the use of queries by object locations and key-
words. Queries by color-sketch were used sparingly since they resulted to be
less stable and sometimes degrades the quality of results obtained with the key-
words/object search. As showed in Figure 3, for our system the textual-KIS task
was the hardest, accordingly to the observation done by the organizers of the
competition in [12], where they note that textual-KIS task is much harder to
solve than visual tasks.
AVS tasks. In this tasks, keywords/object and the image similarity search
functionalities were mainly used. In particular, the image similarity search re-
sulted to be notably useful to retrieve keyframes of different videos with similar
visual content.
We experienced how an image retrieval system could be useful for video
search, for the (Textual KIS) the results were not particularly satisfying, but for
the AVS task are very promising. A problem on the textual KIS is a too specific
categorisation of the object which decreased the recall: sometimes users does
not distinguish between car or trunk or vehicle and they may use one of them
(as textual query) indistinctly. However, for the YOLO detector the difference
is quite significant and this leads to low recall. Globally speaking, one of the
main problem was due to a rather simple user interface. In fact, Visione was
not supporting functionality like query history, multiple submissions at once, or
any form of collaboration between the team members: this leads to redundant
submissions and “slow” submission of multiple instances.
An Image Retrieval System for Video 7
Fig. 3. The VBS2019 competition results for the three tasks AVS, KIS-textual and
KIS-visual (score between 0 and 100). The bold line highlights the result of our system.
5 Conclusion
We described Visione, a system presented at the Video Browser Showdown 2019
challenge. The system supports three types of queries: query by keywords, query
by object location, and query by visual similarity. Visione exploits state-of-the-art
deep learning approaches and ad-hoc surrogate text encodings of the extracted
features in order to use efficient technologies for text retrieval. From the experi-
ence at the competition, we ascertained a high efficiency regarding the indexing
structure, made to support large scale multimedia access but a lack of effective-
ness on keywords search. As a result of the system assessment made after the
competition, we decide to invest a more effort on the keywords-based search,
trying to ameliorate the image annotation part: we plan to integrate dataset of
place, concept and categories ([19], [4], [9]), and automatic tools for scene un-
derstanding. Furthermore, we will improve the user interface to make it more
usable and collaborative.
1. Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C.,
Vadicamo, L., Vairo, C.: VISIONE at VBS2019. In: MultiMedia Modeling - 25th
International Conference, MMM 2019, Thessaloniki, Greece, January 8-11, 2019,
Proceedings, Part II. pp. 591–596 (2019)
8 P. Bolettieri et al.
2. Amato, G., Falchi, F., Gennaro, C., Rabitti, F.: Searching and Annotating 100M
Images with YFCC100M-HNfc6 and MI-File. In: Proceedings of the 15th Inter-
national Workshop on Content-Based Multimedia Indexing. pp. 26:1–26:4. CBMI
’17, ACM (2017)
3. Amato, G., Falchi, F., Gennaro, C., Vadicamo, L.: Deep permutations: Deep con-
volutional neural networks and permutation-based indexing. In: Similarity Search
and Applications. pp. 93–106. Springer International Publishing, Cham (2016)
4. Awad, G., Snoek, C.G.M., Smeaton, A.F., Qu´enot, G.: Trecvid semantic indexing
of video : A 6-year retrospective. ITE Transactions on Media Technology and
Applications 4(3), 187–208 (2016)
5. Fellbaum, C., Miller, G.: WordNet: an electronic lexical database. Language,
speech, and communication, MIT Press (1998)
6. Gennaro, C., Amato, G., Bolettieri, P., Savino, P.: An approach to content-based
image retrieval based on the lucene search engine library. In: Research and Ad-
vanced Technology for Digital Libraries. pp. 55–66. Springer Berlin Heidelberg
7. Gordo, A., Almaz´an, J., Revaud, J., Larlus, D.: End-to-end learning of deep vi-
sual representations for image retrieval. International Journal of Computer Vision
124(2), 237–254 (2017)
8. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding.
arXiv preprint arXiv:1408.5093 (2014)
9. Jiang, Y.G., Wu, Z., Wang, J., Xue, X., Chang, S.F.: Exploiting feature and class
relationships in video categorization with regularized deep neural networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence 40(2), 352–364 (2018)
10. Lokoˇc, J., Kovalˇc´ık, G., Souˇcek, T.: Revisiting siret video retrieval tool. In: Mul-
tiMedia Modeling. pp. 419–424. Springer International Publishing, Cham (2018). 44
11. Lokoˇc, J., Bailer, W., Sch¨offmann, K., M¨unzer, B., Awad, G.: On influential trends
in interactive video retrieval: Video browser showdown 2015–2017. IEEE Transac-
tions on Multimedia 20(12), 3361–3376 (2018)
12. Lokoˇc, J., Kovalˇc´ık, G., M¨unzer, B., Sch¨offmann, K., Bailer, W., Gasser, R.,
Vrochidis, S., Nguyen, P.A., Rujikietgumjorn, S., Barthel, K.U.: Interactive search
or sequential browsing? a detailed analysis of the video browser showdown 2018.
ACM Trans. Multimedia Comput. Commun. Appl. 15(1), 29:1–29:18 (2019)
13. Niraimathi, D.S.: Color based image segmentation using classification of k-nn with
contour analysis method. International Research Journal of Engineering and Tech-
nology 3, 1169–1177 (2016)
14. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767 (2018)
15. Rossetto, L., Schuldt, H., Awad, G., Butt, A.A.: V3c – a research video collection.
In: MultiMedia Modeling. pp. 349–360. Springer International Publishing, Cham
16. Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth,
D., Li, L.J.: YFCC100M: The new data in multimedia research. Communications
of the ACM 59(2), 64–73 (2016)
17. Tolias, G., Sicre, R., J´egou, H.: Particular object retrieval with integral max-
pooling of cnn activations. arXiv preprint arXiv:1511.05879 (2015)
18. Truong, T.D., Nguyen, V.T., Tran, M.T., Trieu, T.V., Do, T., Ngo, T.D., Le, D.D.:
Video search based on semantic extraction and locally regional object proposal. In:
MultiMedia Modeling. pp. 451–456. Springer International Publishing (2018)
An Image Retrieval System for Video 9
19. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million
image database for scene recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence (2017)
... A first release of VISIONE [1,6], which participated in the 2019 edition of the Video Browser Showdown (VBS) [11], is described in details in [2]. VBS is an international video search competition that is held annually since 2012 [13]. ...
This paper presents the second release of VISIONE, a tool for effective video search on large-scale collections. It allows users to search for videos using textual descriptions, keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. One of the main features of our system is that it employs specially designed textual encodings for indexing and searching video content using the mature and scalable Apache Lucene full-text search engine.
... Within this framework, we developed a content-based video retrieval system VISIONE [30,31], to compete at the Video Browser Showdown (VBS) [32], an international video search competition that evaluates the performance of interactive video retrievals systems. VISIONE is based on stateof-the-art deep learning approaches for the visual content analysis and exploits highly efficient indexing techniques to ensure scalability. ...
Technical Report
Full-text available
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale. We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data. Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality. This report summarize the 2019 activities of the research group.
Conference Paper
Full-text available
We present an image search engine that allows searching by similarity about 100M images included in the YFCC100M dataset, and annotate query images. Image similarity search is performed using YFCC100M-HNfc6, the set of deep features we extracted from the YFCC100M dataset, which was indexed using the MI-File index for efficient similarity searching. A metadata cleaning algorithm, that uses visual and textual analysis, was used to select from the YFCC100M dataset a relevant subset of images and associated annotations, to create a training set to perform automatic textual annotation of submitted queries. The on-line image and annotation system demonstrates the effectiveness of the deep features for assessing conceptual similarity among images, the effectiveness of the metadata cleaning algorithm, to identify a relevant training set for annotation, and the efficiency and accuracy of the MI-File similarity index techniques, to search and annotate using a dataset of 100M images, with very limited computing resources.
Full-text available
While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval. In this article, we argue that reasons for the underwhelming results of deep methods on image retrieval are threefold: i) noisy training data, ii) inappropriate deep architecture, and iii) suboptimal training procedure. We address all three issues. First, we leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval. Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it. Last, we train this network with a siamese architecture that combines three streams with a triplet loss. At the end of the training process, the proposed architecture produces a global image representation in a single forward pass that is well suited for image retrieval. Extensive experiments show that our approach significantly outperforms previous retrieval approaches, including state-of-the-art methods based on costly local descriptor indexing and spatial verification. On Oxford 5k, Paris 6k and Holidays, we respectively report 94.7, 96.6, and 94.8 mean average precision. Our representations can also be heavily compressed using product quantization with little loss in accuracy. For additional material, please see
This work summarizes the findings of the 7th iteration of the Video Browser Showdown (VBS) competition organized as a workshop at the 24th International Conference on Multimedia Modeling in Bangkok. The competition focuses on video retrieval scenarios in which the searched scenes were either previously observed or described by another person (i.e., an example shot is not available). During the event, nine teams competed with their video retrieval tools in providing access to a shared video collection with 600 hours of video content. Evaluation objectives, rules, scoring, tasks, and all participating tools are described in the article. In addition, we provide some insights into how the different teams interacted with their video browsers, which was made possible by a novel interaction logging mechanism introduced for this iteration of the VBS. The results collected at the VBS evaluation server confirm that searching for one particular scene in the collection when given a limited time is still a challenging task for many of the approaches that were showcased during the event. Given only a short textual description, finding the correct scene is even harder. In ad hoc search with multiple relevant scenes, the tools were mostly able to find at least one scene, whereas recall was the issue for many teams. The logs also reveal that even though recent exciting advances in machine learning narrow the classical semantic gap problem, user-centric interfaces are still required to mediate access to specific content. Finally, open challenges and lessons learned are presented for future VBS events.
The last decade has seen innovations that make video recording, manipulation, storage and sharing easier than ever before, thus impacting many areas of life. New video retrieval scenarios emerged as well, which challenge the state-of-the-art video retrieval approaches. Despite recent advances in content analysis, video retrieval can still benefit from involving the human user in the loop. We present our experience with a class of interactive video retrieval scenarios and our methodology to stimulate the evolution of new interactive video retrieval approaches. More specifically, the Video Browser Showdown evaluation campaign is thoroughly analyzed, focusing on the years 2015-2017. Evaluation scenarios, objectives and metrics are presented, complemented by the results of the annual evaluations. The results reveal promising interactive video retrieval techniques adopted by the most successful tools and confirm assumptions about the different complexity of various types of interactive retrieval scenarios. A comparison of the interactive retrieval tools with automatic approaches (including fully automatic and manual query formulation) participating in the TRECVID 2016 Ad-hoc Video Search (AVS) task is discussed. Finally, based on the results of data analysis, a substantial revision of the evaluation methodology for the following years of the Video Browser Showdown is provided.
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.