To read the full-text of this research, you can request a copy directly from the authors.
... Nevertheless, trained feature detectors used by the retrieval models can still fail in some cases and also users usually do not know/remember all details of the searched (more complex) scene to provide a comprehensive query. Hence, the three properties are not always satisfied and so current known-item search video retrieval tools [1,2,8,30,31,34] still rely also on interactive search  to boost their effectiveness. The tools integrate various information retrieval models  and informative visualizations in well-arranged responsive interfaces to let users inspect results and decide between various search strategies. ...
... The time of correct submissions of participating teams is marked on the line with respect to the start of each task. At VBS 2019, other five systems participatedvitrivr , VIREO , VISIONE , diveXplore , VERGE . ...
... Considering the time to solve a task by the VIRET prototype tool, frequent browsing interactions and (re)constructions of a multimodal query often led to a longer average time for solved KIS (1,2,3,4,5,6,7,8,9,10) Figure 5: The time elapsed until a correct submission was received from a tool at VBS 2019 in visual and textual KIS tasks. The time limit t L was set to 5 minutes for visual KIS tasks and 8 minutes for textual KIS tasks. ...
Searching for one particular scene in a large video collection (known-item search) represents a challenging task for video retrieval systems. According to the recent results reached at evaluation campaigns, even respected approaches based on machine learning do not help to solve the task easily in many cases. Hence, in addition to effective automatic multimedia annotation and embedding, interactive search is recommended as well. This paper presents a comprehensive description of an interactive video retrieval framework VIRET that successfully participated at several recent evaluation campaigns. Utilized video analysis, feature extraction and retrieval models are detailed as well as several experiments evaluating effectiveness of selected system components. The results of the prototype at the Video Browser Showdown 2019 are highlighted in connection with an analysis of collected query logs. We conclude that the framework comprise a set of effective and efficient models for most of the evaluated known-item search tasks in 1000 hours of video and could serve as a baseline reference approach. The analysis also reveals that the result presentation interface needs improvements for better performance of future VIRET prototypes.
... In this paper, we aim at describing the latest version of VISIONE for participating to the Video Browser Showdown (VBS) [10,17]. The first version of the tool [1,2] and the second  participated in previous editions of the competition, VBS 2019 and VBS 2021, respectively. VBS is an international video search competition that is held annually since 2012 and comprises three tasks, consisting of visual and textual known-item search (KIS) and ad-hoc video search (AVS) [10,17]. ...
... One of the main characteristics of our system is that all the features extracted from the video keyframes, as well as from the user query, are transformed into textual encodings so that an off-the-shelf full-text search engine is employed to support large-scale indexing and searching (see  for further details). While the object/color and similarity search functionalities have been present in VISIONE since its first version , the text search and the temporal search were introduced last year . The semantic similarity search (Sect. ...
VISIONE is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). It uses a full-text search engine as a search backend. In the latest version of our system, we modified the user interface, and we made some changes to the techniques used to analyze and search for videos.
... In  we initially introduced VISIONE by only listing its functionalities and briefly outlining the techniques it employs. In this work, instead, we have two main goals: first, to provide a more detailed description of all the functionalities included in VISIONE and how each of them are implemented and combined together; second, to present an analysis of the system retrieval performance by examining the logs acquired during the VBS2019 challenge. ...
This paper describes in detail VISIONE, a video search system that allows users to search for videos using textual keywords, the occurrence of objects and their spatial relationships, the occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and meet users' needs. The peculiarity of our approach is that we encode all information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding that is indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval performance of the system, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies from those we tested.
... In the last decade, the VBS community has developed and incrementally improved many interactive search systems [1,11,13,15,16], some of which were made available as open-source to benefit the community. As an example, vitrivr [14,15], one of the most successful systems at VBS, is available as open source as well 1 . ...
In the last decade, the Video Browser Showdown (VBS) became a comparative platform for various interactive video search tools competing in selected video retrieval tasks. However, the participation of new teams with an own, novel tool is prohibitively time-demanding because of the large number and complexity of components required for constructing a video search system from scratch. To partially alleviate this difficulty, we provide an open-source version of the lightweight known-item search system SOMHunter that competed successfully at VBS 2020. The system combines several features for text-based search initialization and browsing of large result sets; in particular a variant of W2VV++ model for text search, temporal queries for targeting sequences of frames, several types of displays including the eponymous self-organizing map view, and a feedback-based approach for maintaining the relevance scores inspired by PICHunter. The minimalistic, easily extensible implementation of SOMHunter should serve as a solid basis for constructing new search systems, thus facilitating easier exploration of new video retrieval ideas.
... Within this framework, we developed a content-based video retrieval system VISIONE [30,31], to compete at the Video Browser Showdown (VBS) , an international video search competition that evaluates the performance of interactive video retrievals systems. VISIONE is based on stateof-the-art deep learning approaches for the visual content analysis and exploits highly efficient indexing techniques to ensure scalability. ...
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale.
We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data.
Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality.
This report summarize the 2019 activities of the research group.
... Furthermore, the confrontation positively drives evolution of all the participating systems. For example, in the last installment organized at the international MultiMedia Modeling conference MMM 2019, two systems (vitrivr  and VIRET ) were able to solve 18 known-item search tasks out of 23, while the other systems (VIREO , VISIONE , diveXplore , and VERGE ) also solved a high number (9-11) of evaluated tasks in 1000 hours of video! This remarkable performance of the systems is partially supported also by various new deep learning based representations integrated to the retrieval engines. ...
Known-item search in large video collections still represents a challenging task for current video retrieval systems that have to rely both on state-of-the-art ranking models and interactive means of retrieval. We present a general overview of the current version of the VIRET tool, an interactive video retrieval system that successfully participated at several international evaluation campaigns. The system is based on multi-modal search and convenient inspection of results. Based on collected query logs of four users controlling instances of the tool at the Video Browser Showdown 2019, we highlight query modification statistics and a list of successful query formulation strategies. We conclude that the VIRET tool represents a competitive reference interactive system for effective known-item search in one thousand hours of video.
We present IVOS, an interactive video content search system that allows for object-based search and filtering in video archives. The main idea behind is to use the result of recent object detection models to index all keyframes with a manageable set of object classes, and allow the user to filter by different characteristics, such as object name, object location, relative object size, object color, and combinations for different object classes – e.g., “large person in white on the left, with a red tie”. In addition to that, IVOS can also find segments with a specific number of objects of a particular class (e.g., “many apples” or “two people”) and supports similarity search, based on similar object occurrences.
This paper presents the second release of VISIONE, a tool for effective video search on large-scale collections. It allows users to search for videos using textual descriptions, keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. One of the main features of our system is that it employs specially designed textual encodings for indexing and searching video content using the mature and scalable Apache Lucene full-text search engine.
Despite the fact that automatic content analysis has made remarkable progress over the last decade - mainly due to significant advances in machine learning - interactive video retrieval is still a very challenging problem, with an increasing relevance in practical applications. The Video Browser Showdown (VBS) is an annual evaluation competition that pushes the limits of interactive video retrieval with state-of-the-art tools, tasks, data, and evaluation metrics. In this paper, we analyse the results and outcome of the 8th iteration of the VBS in detail. We first give an overview of the novel and considerably larger V3C1 dataset and the tasks that were performed during VBS 2019. We then go on to describe the search systems of the six international teams in terms of features and performance. And finally, we perform an in-depth analysis of the per-team success ratio and relate this to the search strategies that were applied, the most popular features, and problems that were experienced. A large part of this analysis was conducted based on logs that were collected during the competition itself. This analysis gives further insights into the typical search behavior and differences between expert and novice users. Our evaluation shows that textual search and content browsing are the most important aspects in terms of logged user interactions. Furthermore, we observe a trend towards deep learning based features, especially in the form of labels generated by artificial neural networks. But nevertheless, for some tasks, very specific content-based search features are still being used. We expect these findings to contribute to future improvements of interactive video search systems.
During the last three years, the most successful systems at the Video Browser Showdown employed effective retrieval models where raw video data are automatically preprocessed in advance to extract semantic or low-level features of selected frames or shots. This enables users to express their search intents in the form of keywords, sketch, query example, or their combination. In this paper, we present new extensions to our interactive video retrieval system VIRET that won Video Browser Showdown in 2018 and achieved the second place at Video Browser Showdown 2019 and Lifelog Search Challenge 2019. The new features of the system focus both on updates of retrieval models and interface modifications to help users with query specification by means of informative visualizations.
In this paper, we present the features implemented in the 4th version of the VIREO Video Search System (VIREO-VSS). In this version, we propose a sketch-based retrieval model, which allows the user to specify a video scene with objects and their basic properties, including color, size, and location. We further utilize the temporal relation between video frames to strengthen this retrieval model. For text-based retrieval module, we supply speech and on-screen text for free-text search and upgrade the concept bank for concept search. The search interface is also re-designed targeting the novice user. With the introduced system, we expect that the VIREO-VSS can be a competitive participant in the Video Browser Showdown (VBS) 2020.
Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. Nowadays, the rapid increase of video data has paved the way to the advancement of the technologies in many different communities for the creation of Content-Based Video Indexing and Retrieval (CBVIR). However, greater attention needs to be devoted to the development of effective tools for video search and browse. In this paper, we present Visione, a system for large-scale video retrieval. The system integrates several content-based analysis and retrieval modules, including a keywords search, a spatial object-based search, and a visual similarity search. From the tests carried out by users when they needed to find as many correct examples as possible, the similarity search proved to be the most promising option. Our implementation is based on state-of-the-art deep learning approaches for content analysis and leverages highly efficient indexing techniques to ensure scalability. Specifically, we encode all the visual and textual descriptors extracted from the videos into (surrogate) textual representations that are then efficiently indexed and searched using an off-the-shelf text search engine using similarity functions.
We present an image search engine that allows searching by similarity about 100M images included in the YFCC100M dataset, and annotate query images. Image similarity search is performed using YFCC100M-HNfc6, the set of deep features we extracted from the YFCC100M dataset, which was indexed using the MI-File index for efficient similarity searching. A metadata cleaning algorithm, that uses visual and textual analysis, was used to select from the YFCC100M dataset a relevant subset of images and associated annotations, to create a training set to perform automatic textual annotation of submitted queries. The on-line image and annotation system demonstrates the effectiveness of the deep features for assessing conceptual similarity among images, the effectiveness of the metadata cleaning algorithm, to identify a relevant training set for annotation, and the efficiency and accuracy of the MI-File similarity index techniques, to search and annotate using a dataset of 100M images, with very limited computing resources.
While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval. In this article, we argue that reasons for the underwhelming results of deep methods on image retrieval are threefold: i) noisy training data, ii) inappropriate deep architecture, and iii) suboptimal training procedure. We address all three issues. First, we leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval. Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it. Last, we train this network with a siamese architecture that combines three streams with a triplet loss. At the end of the training process, the proposed architecture produces a global image representation in a single forward pass that is well suited for image retrieval. Extensive experiments show that our approach significantly outperforms previous retrieval approaches, including state-of-the-art methods based on costly local descriptor indexing and spatial verification. On Oxford 5k, Paris 6k and Holidays, we respectively report 94.7, 96.6, and 94.8 mean average precision. Our representations can also be heavily compressed using product quantization with little loss in accuracy. For additional material, please see www.xrce.xerox.com/Deep-Image-Retrieval.
The activation of the Deep Convolutional Neural Networks hidden layers can be successfully used as features, often referred as Deep Features, in generic visual similarity search tasks.
Recently scientists have shown that permutation-based methods offer very good performance in indexing and supporting approximate similarity search on large database of objects. Permutation-based approaches represent metric objects as sequences (permutations) of reference objects, chosen from a predefined set of data. However, associating objects with permutations might have a high cost due to the distance calculation between the data objects and the reference objects.
In this work, we propose a new approach to generate permutations at a very low computational cost, when objects to be indexed are Deep Features. We show that the permutations generated using the proposed method are more effective than those obtained using pivot selection criteria specifically developed for permutation-based methods.
Interactive video retrieval tools developed over the past few years are emerging as powerful alternatives to automatic retrieval approaches by giving the user more control as well as more responsibilities. Current research tries to identify the best combinations of image, audio and text features that combined with innovative UI design maximize the tools performance. We present the last installment of the Video Browser Showdown 2015 which was held in conjunction with the International Conference on MultiMedia Modeling 2015 (MMM 2015) and has the stated aim of pushing for a better integration of the user into the search process. The setup of the competition including the used dataset and the presented tasks as well as the participating tools will be introduced . The performance of those tools will be thoroughly presented and analyzed. Interesting highlights will be marked and some predictions regarding the research focus within the field for the near future will be made.
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in
object category classification and detection on hundreds of object categories
and millions of images. The challenge has been run annually from 2010 to
present, attracting participation from more than fifty institutions.
This paper describes the creation of this benchmark dataset and the advances
in object recognition that have been possible as a result. We discuss the
challenges of collecting large-scale ground truth annotation, highlight key
breakthroughs in categorical object recognition, provide detailed a analysis of
the current state of the field of large-scale image classification and object
detection, and compare the state-of-the-art computer vision accuracy with human
accuracy. We conclude with lessons learned in the five years of the challenge,
and propose future directions and improvements.
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments.
Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Known-item search in multimodal lifelog data represents a challenging task for present search engines. Since sequences of temporally close images represent a significant part of the provided data, an interactive video retrieval tool with few extensions could be confronted at Lifelog Search Challenge in known-item search tasks. We present an update of the SIRET interactive video retrieval tool that recently won the Video Browser Showdown 2018. As the tool relies on frame-based representations and retrieval models, it can be directly used also for images from lifelog cameras. The updates comprise mostly visualization and navigation methods for a high number of visually similar scenes representing repetitive daily activities.
The last decade has seen innovations that make video recording, manipulation, storage and sharing easier than ever before, thus impacting many areas of life. New video retrieval scenarios emerged as well, which challenge the state-of-the-art video retrieval approaches. Despite recent advances in content analysis, video retrieval can still benefit from involving the human user in the loop. We present our experience with a class of interactive video retrieval scenarios and our methodology to stimulate the evolution of new interactive video retrieval approaches. More specifically, the Video Browser Showdown evaluation campaign is thoroughly analyzed, focusing on the years 2015-2017. Evaluation scenarios, objectives and metrics are presented, complemented by the results of the annual evaluations. The results reveal promising interactive video retrieval techniques adopted by the most successful tools and confirm assumptions about the different complexity of various types of interactive retrieval scenarios. A comparison of the interactive retrieval tools with automatic approaches (including fully automatic and manual query formulation) participating in the TRECVID 2016 Ad-hoc Video Search (AVS) task is discussed. Finally, based on the results of data analysis, a substantial revision of the evaluation methodology for the following years of the Video Browser Showdown is provided.
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.
Semantic indexing, or assigning semantic tags to video samples, is a key component for content-based access to video documents and collections. The Semantic Indexing task has been run at TRECVid from 2010 to 2015 with the support of NIST and the Quaero project. As with the previous High-Level Feature detection task which ran from 2002 to 2009, the semantic indexing task aims at evaluating methods and systems for detecting visual, auditory or multi-modal concepts in video shots. In addition to the main semantic indexing task, four secondary tasks were proposed namely the “localization” task, the “concept pair” task, the “no annotation” task, and the “progress” task. It attracted over 40 research teams during its running period. The task was conducted using a total of 1,400 hours of video data drawn from Internet Archive videos with Creative Commons licenses gathered by NIST. 200 hours of new test data was made available each year plus 200 more as development data in 2010. The number of target concepts to be detected started from 130 in 2010 and was extended to 346 in 2011. Both the increase in the volume of video data and in the number of target concepts favored the development of generic and scalable methods. Over 8 millions shots×concepts direct annotations plus over 20 millions indirect ones were produced by the participants and the Quaero project on a total of 800 hours of development data. Significant progress was accomplished during the period as this was accurately measured in the context of the progress task but also from some of the participants' contrast experiments. This paper describes the data, protocol and metrics used for the main and the secondary tasks, the results obtained and the main approaches used by participants.
We created the Yahoo Flickr Creative Commons 100 Million Dataseta (YFCC100M) in 2014 as part of the Yahoo Webscope program, which is a reference library of interesting and scientifically useful datasets. The YFCC100M is the largest public multimedia collection ever released, with a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all uploaded to Flickr between 2004 and 2014 and published under a CC commercial or noncommercial license. The dataset is distributed through Amazon Web Services as a 12.5GB compressed archive containing only metadata. However, as with many datasets, the YFCC100M is constantly evolving; over time, we have released and will continue to release various expansion packs containing data not yet in the collection; for instance, the actual photos and videos, as well as several visual and aural features extracted from the data, have already been uploaded to the cloud, ensuring the dataset remains accessible and intact for years to come. The YFCC100M dataset overcomes many of the issues affecting existing multimedia datasets in terms of modalities, metadata, licensing, and, principally, volume.
In this paper, we study the challenging problem of categorizing videos
according to high-level semantics such as the existence of a particular human
action or a complex event. Although extensive efforts have been devoted in
recent years, most existing works combined multiple video features using simple
fusion strategies and neglected the utilization of inter-class semantic
relationships. This paper proposes a novel unified framework that jointly
exploits the feature relationships and the class relationships for improved
categorization performance. Specifically, these two types of relationships are
estimated and utilized by rigorously imposing regularizations in the learning
process of a deep neural network (DNN). Such a regularized DNN (rDNN) can be
efficiently realized using a GPU-based implementation with an affordable
training cost. Through arming the DNN with better capability of harnessing both
the feature and the class relationships, the proposed rDNN is more suitable for
modeling video semantics. With extensive experimental evaluations, we show that
rDNN produces superior performance over several state-of-the-art approaches. On
the well-known Hollywood2 and Columbia Consumer Video benchmarks, we obtain
very competitive results: 66.9\% and 73.5\% respectively in terms of mean
average precision. In addition, to substantially evaluate our rDNN and
stimulate future research on large scale video categorization, we collect and
release a new benchmark dataset, called FCVID, which contains 91,223 Internet
videos and 239 manually annotated categories.
Content-based image retrieval is becoming a popular way for searching digital libraries as the amount of available multimedia data increases. However, the cost of developing from scratch a robust and reliable system with content-based image retrieval facilities for large databases is quite prohibitive.
In this paper, we propose to exploit an approach to perform approximate similarity search in metric spaces developed by [3,6]. The idea at the basis of these techniques is that when two objects are very close one to each other they ’see’ the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the views of
the world at different objects, in place of the distance function of the underlying metric space. To employ this idea the low level image features (such as colors and textures) are converted into a textual form and are indexed into the inverted index by means of the Lucene search engine library. The conversion of the features in textual form allows us to
employ the Lucene’s off-the-shelf indexing and searching abilities with a little implementation effort. In this way, we are able to set up a robust information retrieval system that combines full-text search with contentbased image retrieval capabilities.