Article

Interactive Search or Sequential Browsing? A Detailed Analysis of the Video Browser Showdown 2018

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This work summarizes the findings of the 7th iteration of the Video Browser Showdown (VBS) competition organized as a workshop at the 24th International Conference on Multimedia Modeling in Bangkok. The competition focuses on video retrieval scenarios in which the searched scenes were either previously observed or described by another person (i.e., an example shot is not available). During the event, nine teams competed with their video retrieval tools in providing access to a shared video collection with 600 hours of video content. Evaluation objectives, rules, scoring, tasks, and all participating tools are described in the article. In addition, we provide some insights into how the different teams interacted with their video browsers, which was made possible by a novel interaction logging mechanism introduced for this iteration of the VBS. The results collected at the VBS evaluation server confirm that searching for one particular scene in the collection when given a limited time is still a challenging task for many of the approaches that were showcased during the event. Given only a short textual description, finding the correct scene is even harder. In ad hoc search with multiple relevant scenes, the tools were mostly able to find at least one scene, whereas recall was the issue for many teams. The logs also reveal that even though recent exciting advances in machine learning narrow the classical semantic gap problem, user-centric interfaces are still required to mediate access to specific content. Finally, open challenges and lessons learned are presented for future VBS events.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The Video Browser Showdown (VBS) [7]- [9] -first held in 2012 -is an annual video search evaluation campaign that employs a competitive format, which allows participating teams to evaluate their state-of-the-art interactive video retrieval systems in direct comparison to one another. It provides a fair and live performance assessment of retrieval systems for the same search tasks, on the same dataset, in the same environment. ...
... The task selection process for VBS 2019 followed the same procedure as in earlier years [9]. Table I lists the textual KIS tasks used during the private session with the expert users. ...
... This section presents the overall results of the competition as well as a discussion on the setting under which they were produced. 3,4,5,6,7,8,9,10) Figure 2: Time elapsed until the first correct submission per team for all scoring teams at VBS 2019. The tasks for novices were the same as for experts, except for their order. ...
Article
Despite the fact that automatic content analysis has made remarkable progress over the last decade - mainly due to significant advances in machine learning - interactive video retrieval is still a very challenging problem, with an increasing relevance in practical applications. The Video Browser Showdown (VBS) is an annual evaluation competition that pushes the limits of interactive video retrieval with state-of-the-art tools, tasks, data, and evaluation metrics. In this paper, we analyse the results and outcome of the 8th iteration of the VBS in detail. We first give an overview of the novel and considerably larger V3C1 dataset and the tasks that were performed during VBS 2019. We then go on to describe the search systems of the six international teams in terms of features and performance. And finally, we perform an in-depth analysis of the per-team success ratio and relate this to the search strategies that were applied, the most popular features, and problems that were experienced. A large part of this analysis was conducted based on logs that were collected during the competition itself. This analysis gives further insights into the typical search behavior and differences between expert and novice users. Our evaluation shows that textual search and content browsing are the most important aspects in terms of logged user interactions. Furthermore, we observe a trend towards deep learning based features, especially in the form of labels generated by artificial neural networks. But nevertheless, for some tasks, very specific content-based search features are still being used. We expect these findings to contribute to future improvements of interactive video search systems.
... Interactive Retrieval based on relevance feedback has been an active area of systems research and used by participants of evaluation campaigns such as the Video Browser Showdown (VBS) in the past [7][8][9]. Two prominent examples include the winner of the 2020 VBS campaign SOMHunter [6] and Exquisitor [5]. ...
... The reason for reporting on videos instead of segments is that vitrivr tends to over-segment, and therefore users often submit segments very close to one another, which makes numbers of uniquely submitted segments less meaningful. As already discussed in analyses of AVS tasks at VBS [8], measuring the actual recall is not possible due to the lack of a ground truth. Figure 4 shows all metrics per task and team. ...
Chapter
Full-text available
vitrivr is an open-source system for indexing and retrieving multimedia data based on its content and it has been a fixture at the Video Browser Showdown (VBS) in the past years. While vitrivr has proven to be competitive in content-based retrieval due to the many different query modes it supports, its functionality is rather limited when it comes to exploring a collection or searching result sets based on content. In this paper, we present vitrivr-explore, an extension to the vitrivr stack that allows to explore multimedia collections using relevance feedback. For this, our implementation integrates into the existing features of vitrivr and exploits self-organizing maps. Users initialize the exploration by either starting with a query or just picking examples from a collection while browsing. Exploration can be based on a mixture of semantic and visual features. We describe our architecture and implementation and present first results of the effectiveness of vitrivr-explore in a VBS-like evaluation. These results show that vitrivr-explore is competitive for Ad-hoc Video Search (AVS) tasks, even without user initialization.
... Co-located with the annual International Conference on Multimedia Retrieval (ICMR), the Lifelog Search Challenge (LSC) [9] is a platform for testing and comparing new systems that try to tackle the problem of data management and information retrieval in lifelog collections. It offers a competitive setting -similar to the one found during the Video Browser Showdown (VBS) [18,19] -in which these tools try to solve the given tasks within a set time limit. The faster a tool succeeds in solving a task, the higher the score it receives. ...
... vitrivr is no stranger to multimedia retrieval competitions, being a long running participant of the Video Browser Showdown (VBS) [2,19] and winner of the competition in 2017 and 2019 [31] as well as last year's installment of LSC [25]. ...
Conference Paper
Full-text available
The variety and amount of data being collected in our everyday life poses unique challenges for multimedia retrieval. In the Lifelog Search Challenge (LSC), multimedia retrieval systems compete in finding events based on descriptions containing hints about structured, semi-structured an unstructured data. In this paper, we present the multimedia retrieval system vitrivr with a focus on the changes and additions made based on the new dataset, and our successful participation at LSC 2019. Specifically, we show how the new dataset can be used for retrieval in different modalities without sacrificing efficiency, describe two recent additions, temporal scoring and staged querying, and discuss the deep learning methods used to enrich the dataset.
... Request permissions from permissions@acm.org. MM '20, October 12-16, 2020 known-item search system [8,9]. However, many research teams in the multimedia retrieval community focus just on a specific category of tasks, and find it too time-demanding to include their developed model in a more complex interactive video search framework. ...
... The latter case may be especially useful with the VBS-standard-compliant logger and submission component. We believe that sharing the software will help the new teams with joining interactive search evaluation campaigns (such as VBS [9] and LSC [3]), and provide a solid basis for construction of new systems and development of new approaches to video retrieval. ...
Conference Paper
Full-text available
In the last decade, the Video Browser Showdown (VBS) became a comparative platform for various interactive video search tools competing in selected video retrieval tasks. However, the participation of new teams with an own, novel tool is prohibitively time-demanding because of the large number and complexity of components required for constructing a video search system from scratch. To partially alleviate this difficulty, we provide an open-source version of the lightweight known-item search system SOMHunter that competed successfully at VBS 2020. The system combines several features for text-based search initialization and browsing of large result sets; in particular a variant of W2VV++ model for text search, temporal queries for targeting sequences of frames, several types of displays including the eponymous self-organizing map view, and a feedback-based approach for maintaining the relevance scores inspired by PICHunter. The minimalistic, easily extensible implementation of SOMHunter should serve as a solid basis for constructing new search systems, thus facilitating easier exploration of new video retrieval ideas.
... Several video retrieval systems participated at the VBS in the last years [1,3,23,24]. Most of them, including our system, support multimodal search with interactive query formulation. ...
Article
Full-text available
This paper describes in detail VISIONE, a video search system that allows users to search for videos using textual keywords, the occurrence of objects and their spatial relationships, the occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and meet users' needs. The peculiarity of our approach is that we encode all information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding that is indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval performance of the system, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies from those we tested.
... Queries by color-sketch were used sparingly since they resulted to be less stable and sometimes degrades the quality of results obtained with the keywords/object search. As showed in Figure 3, for our system the textual-KIS task was the hardest, accordingly to the observation done by the organizers of the competition in [12], where they note that textual-KIS task is much harder to solve than visual tasks. ...
Chapter
Full-text available
Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. Nowadays, the rapid increase of video data has paved the way to the advancement of the technologies in many different communities for the creation of Content-Based Video Indexing and Retrieval (CBVIR). However, greater attention needs to be devoted to the development of effective tools for video search and browse. In this paper, we present Visione, a system for large-scale video retrieval. The system integrates several content-based analysis and retrieval modules, including a keywords search, a spatial object-based search, and a visual similarity search. From the tests carried out by users when they needed to find as many correct examples as possible, the similarity search proved to be the most promising option. Our implementation is based on state-of-the-art deep learning approaches for content analysis and leverages highly efficient indexing techniques to ensure scalability. Specifically, we encode all the visual and textual descriptors extracted from the videos into (surrogate) textual representations that are then efficiently indexed and searched using an off-the-shelf text search engine using similarity functions.
... Both successful browsing systems support also query initialization (keywords, sketches) to initialize the search. For an overview of other related tools, we refer the readers to recent Video Browser Showdown (VBS) summary papers [12,25,26]. ...
Conference Paper
Searching for one particular scene in a large video collection (known-item search) represents a challenging task for video retrieval systems. According to the recent results reached at evaluation campaigns, even respected approaches based on machine learning do not help to solve the task easily in many cases. Hence, in addition to effective automatic multimedia annotation and embedding, interactive search is recommended as well. This paper presents a comprehensive description of an interactive video retrieval framework VIRET that successfully participated at several recent evaluation campaigns. Utilized video analysis, feature extraction and retrieval models are detailed as well as several experiments evaluating effectiveness of selected system components. The results of the prototype at the Video Browser Showdown 2019 are highlighted in connection with an analysis of collected query logs. We conclude that the framework comprise a set of effective and efficient models for most of the evaluated known-item search tasks in 1000 hours of video and could serve as a baseline reference approach. The analysis also reveals that the result presentation interface needs improvements for better performance of future VIRET prototypes.
... An example of such task is known-item search (KIS), where users search for one particular "known" scene in a given annotation-free multimedia collection [6]. However, even though the known-item search systems can benefit from ranking models based on more effective automatic annotations and various other visual modalities, interactive searching is still required to aid the ranking model in order to target the searched item precisely [7]. For example, the VIRET system [8] provides multi-modal and temporal query formulation interface, and allows the user to iteratively update the query with new terms derived from observation of the candidate result set. ...
Chapter
Full-text available
This paper presents a prototype video retrieval engine focusing on a simple known-item search workflow, where users initialize the search with a query and then use an iterative approach to explore a larger candidate set. Specifically, users gradually observe a sequence of displays and provide feedback to the system. The displays are dynamically created by a self organizing map that employs the scores based on the collected feedback, in order to provide a display matching the user preferences. In addition, users can inspect various other types of specialized displays for exploitation purposes, once promising candidates are found.
... In this paper, we present the recent improvements made to vitrivr [18], our multimedia retrieval stack capable of processing several different types of media documents [3]. vitrivr (and its predecessor, the IMOTION system [16]) has participated in the Video Browser Showdown (VBS) [9] for several years [17] and recently also made its debut [13] at the Lifelog Search Challenge (LSC) 2019 [6]. Throughout its development history, vitrivr has gained a large amount of content-based retrieval related functionality. ...
Chapter
This paper presents the most recent additions to the vitrivr multimedia retrieval stack made in preparation for the participation to the 9 Video Browser Showdown (VBS) in 2020. In addition to refining existing functionality and adding support for classical Boolean queries and metadata filters, we also completely replaced our storage engine by a new database called Cottontail DB. Furthermore, we have added support for scoring based on the temporal ordering of multiple video segments with respect to a query formulated by the user. Finally, we have also added a new object detection module based on Faster-RCNN and use the generated features for object instance search.
... Ultimately however, each approach aims to support the user in quickly identifying promising scenes before watching them back to make judgements about their relevance. Though interactive systems typically take longer than fully-automatic systems to achieve equivalent levels of recall, they can be used to ensure high precision results ( Loko č et al., 2019 ). ...
Article
Patient video taken at home can provide valuable insights into the recovery progress during a programme of physical therapy, but is very time consuming for clinician review. Our work focussed on (i) enabling any patient to share information about progress at home, simply by sharing video and (ii) building intelligent systems to support Physical Therapists (PTs) in reviewing this video data and extracting the necessary detail. This paper reports the development of the system, appropriate for future clinical use without reliance on a technical team, and the clinician involvement in that development. We contribute an interactive content-based video retrieval system that significantly reduces the time taken for clinicians to review videos, using human head movement as an example. The system supports query-by-movement (clinicians move their own body to define search queries) and retrieves the essential fine-grained movements needed for clinical interpretation. This is done by comparing sequences of image-based pose estimates (here head rotations) through a distance metric (here Fréchet distance) and presenting a ranked list of similar movements to clinicians for review. In contrast to existing intelligent systems for retrospective review of human movement, the system supports a flexible analysis where clinicians can look for any movement that interests them. Evaluation by a group of PTs with expertise in training movement control showed that 96% of all relevant movements were identified with time savings of as much as 99.1% compared to reviewing target videos in full. The novelty of this contribution includes retrospective progress monitoring that preserves context through video, and content-based video retrieval that supports both fine-grained human actions and query-by-movement. Future research, including large clinician-led studies , will refine the technical aspects and explore the benefits in terms of patient outcomes, PT time, and financial savings over the course of a programme of therapy. It is anticipated that this clinician-led approach will mitigate the reported slow clinical uptake of technology with resulting patient benefit.
... One notable aspect of these systems was the heavy reliance on visual-based retrieval, which utilised computer vision technologies such as concept-detection and image similarity to aid the retrieval process. Indeed two of the systems [13,15] were directly ported from existing video retrieval engines utilised at the VBS -Video Browser Showdown [12], a similar activity to the LSC, though aimed at interactive video search. ...
Conference Paper
Full-text available
In this paper we present a first version of the LifeSeeker interactive lifelog retrieval engine that is under development at Dublin City University. This retrieval engine has been designed as a platform onto which future lifelog annotation and retrieval engines will be built. The first implementation of LifeSeeker has been designed for the LSC'19 comparative benchmarking challenge and it takes the form of a faceted search and browsing interface with the addition of query expansion to help solve the lexical-gap between novice users and the concept annotation tools employed for annotating the collection.
... Despite all these efforts, however, it has been shown repeatedly [7,9] that the task of finding a particular item in a large enough collection still is an interactive task that requires cooperation between a human actor and a system. This results in the more general setting of interactive retrieval, in which users leverage endto-end retrieval systems to explore media collections and to satisfy a particular information need, by refining queries and browsing through result sets. ...
Conference Paper
The evaluation of the performance of interactive multimedia retrieval systems is a methodologically non-trivial endeavour and requires specialized infrastructure. Current evaluation campaigns have so far relied on a local setting, where all retrieval systems needed to be evaluated at the same physical location at the same time. This constraint does not only complicate the organization and coordination but also limits the number of systems which can reasonably be evaluated within a set time frame. Travel restrictions might further limit the possibility for such evaluations. To address these problems, evaluations need to be conducted in a (geographically) distributed setting, which was so far not possible due to the lack of supporting infrastructure. In this paper, we present the Distributed Retrieval Evaluation Server (DRES), an open-source evaluation system to facilitate evaluation campaigns for interactive multimedia retrieval systems in both traditional on-site as well as fully distributed settings which has already proven effective in a competitive evaluation.
... One of the main takeaways of these two iterations is that an effective user interface for result exploration is at least as important as an expressive retrieval model. This is also a lesson learned from analyses of the Video Browser Showdown (VBS) [18,22], where vitrivr is an active participant [13,27,28]. vitrivr has experimented with exploratory user interfaces in the past [15]. ...
... Still, the most effective way for such retrieval evaluations comes in the form of interactive campaigns, which comparatively evaluate retrieval system using common settings and standardized infrastructure [21] on realistic retrieval tasks. Modelled after the Video Browser Showdown (VBS) [16,20,27], the Lifelog Search Challenge (LSC) [6][7][8]18] is such a campaign, and has established itself as a driving force behind the advances in lifelog retrieval. ...
... The annual Lifelog Search Challenge (LSC) [4,5,7] addresses these issues by providing international teams of researches with a platform for developing strategies aiding the improvement of lifelog data collection search and retrieval. Inspired by another long-standing popular multimedia retrieval competition, the Video Browser Showdown (VBS) [16,17,22], the LSC is held as a live event, where participating teams are tasked with solving several time-constrained search assignments using their custom-developed systems on a lifelogging database [4,6] that spans approximately 4 months of anonymized data. This data is available to the participants during the development phase, hence, it can be used for pre-processing according to the individual teams' preferences and requirements. ...
Chapter
Having participated in the three most recent iterations of the annual Video Browser Showdown (VBS2017–VBS2019) as well as in both newly established Lifelog Search Challenges (LSC2018–LSC2019), the actively developed Deep Interactive Video Exploration (diveXplore) system combines a variety of content-based video analysis and processing strategies for interactively exploring large video archives. The system provides a user with browseable self-organizing feature maps, color filtering, semantic concept search utilizing deep neural networks as well as hand-drawn sketch search. The most recent version improves upon its predecessors by unifying deep concepts for facilitating and speeding up search, while significantly refactoring the user interface for increasing the overall system performance.
Chapter
This paper demonstrates VERGE, an interactive video retrieval engine for browsing a collection of images or videos and searching for specific content. The engine integrates a multitude of retrieval methodologies that include visual and textual searches and further capabilities such as fusion and reranking. All search options and results appear in a web application that aims at a friendly user experience.
Chapter
This paper introduces a video retrieval tool for the 2020 Video Browser Showdown (VBS). The tool enhances the user’s video browsing experience by ensuring full use of video analysis database constructed prior to the Showdown. Deep learning based object detection, scene text detection, scene color detection, audio classification and relation detection with scene graph generation methods have been used to construct the data. The data is composed of visual, textual, and auditory information, broadening the scope to which a user can search beyond visual information. In addition, the tool provides a simple and user-friendly interface for novice users to adapt to the tool in little time.
Chapter
During the last three years, the most successful systems at the Video Browser Showdown employed effective retrieval models where raw video data are automatically preprocessed in advance to extract semantic or low-level features of selected frames or shots. This enables users to express their search intents in the form of keywords, sketch, query example, or their combination. In this paper, we present new extensions to our interactive video retrieval system VIRET that won Video Browser Showdown in 2018 and achieved the second place at Video Browser Showdown 2019 and Lifelog Search Challenge 2019. The new features of the system focus both on updates of retrieval models and interface modifications to help users with query specification by means of informative visualizations.
Chapter
The previous version of our retrieval system has shown some significant results in some retrieval tasks such as Lifelog’s moment retrieval tasks. In this paper, we adapt our platform to the Video Browser Showdown’s KIS and AVS tasks and present how our system performs in video search tasks. In addition to the smart features in our retrieval system that take advantage of the provided analysis data, we enhance the data with object color detection by employing Mask R-CNN and clustering. In this version of our search system, we try to extract the location information of the entities appearing in the videos and aim to exploit the spatial relationship between these entities. We also focus on designing efficient user interaction and a high-performance way to transfer data in the system to minimize the retrieval time.
Chapter
This paper presents a new video retrieval tool, Interactive VIdeo Search Tool (IVIST), which participates in the 2020 Video Browser Showdown (VBS). As a video retrieval tool, IVIST is equipped with proper and high-performing functionalities such as object detection, dominant-color finding, scene-text recognition and text-image retrieval. These functionalities are constructed with various deep neural networks. By adopting these functionalities, IVIST performs well in searching users’ desirable videos. Furthermore, due to user-friendly user interface, IVIST is easy to use even for novice users. Although IVIST is developed to participate in VBS, we hope that it will be applied as a practical video retrieval tool in the future, dealing with actual video data on the Internet.
Chapter
In this paper, we present the features implemented in the 4th version of the VIREO Video Search System (VIREO-VSS). In this version, we propose a sketch-based retrieval model, which allows the user to specify a video scene with objects and their basic properties, including color, size, and location. We further utilize the temporal relation between video frames to strengthen this retrieval model. For text-based retrieval module, we supply speech and on-screen text for free-text search and upgrade the concept bank for concept search. The search interface is also re-designed targeting the novice user. With the introduced system, we expect that the VIREO-VSS can be a competitive participant in the Video Browser Showdown (VBS) 2020.
Article
Large collections of digital video are increasingly accessible. The large volume and range of available video demands search tools that allow people to browse and query easily and to quickly make sense of the videos behind the result sets. This study focused on the usefulness of several multimedia surrogates, in terms of effectiveness, efficiency, and user satisfaction. Three surrogates were evaluated and compared: a storyboard, a 7-second segment, and a fast forward. Thirty-six experienced users of digital video conducted searches on each of four systems: three incorporated one of the surrogates each, and the fourth made all three surrogates available. Participants judged the relevance of at least 10 items for each search based on the surrogate(s) available, then re-judged the relevance of two of those items based on viewing the full video. Transaction logs and post-search and post-session questionnaires provided data on user interactions, including relevance judgments, and user perceptions. All of the surrogates provided a basis for accurate relevance judgments, though they varied (in expected ways) in terms of their efficiency. User perceptions favored the system with all three surrogates available, even though it took longer to use; they found it easier to learn and easier to use, and it gave them more confidence in their judgments. Based on these results, we can conclude that it's important for digital video collections to provide multiple surrogates, each providing a different view of the video.
Chapter
Searching for one particular scene in a large annotation-free video archive becomes a common task in the multimedia age. Since the task is inherently difficult without knowledge of the scene location, multimedia management systems utilize various notions of similarity and provide both effective retrieval models and interactive interfaces. In this paper, we propose a vision of a simulation framework for automatic configuration of interactive known-item search video retrieval systems. We believe that such framework could help with early, resource-inexpensive evaluations and therefore automatic parameters tuning, detection of effective search strategies and effective configuration of client prototypes.
Thesis
Full-text available
The usability of Mobile commerce (M-commerce) websites is a key parameter in determining the success of M-commerce businesses. Literature shows that numerous M-commerce websites have failed to attract customers due to the poor usability of user interfaces. In order to offer superior quality shopping experiences to consumers, it is thus essential to determine the appropriate attributes of successful user interfaces as well as the evaluation methods which should be employed to measure them. The available research resources consulted contained few references to usability evaluation, the identification of appropriate attributes as well as evaluation methods to be used for M-commerce applications. Consequently, the researcher proposes a new usability model for M-commerce websites to determine the suitability of attributes to be included in the proposed model for M-commerce websites. This research work aims to address the imbalance in literature by determining the appropriate attributes of the proposed usability model for usability evaluations of M-commerce applications. In an effort to validate the proposed usability model, an appropriate method to assess usability was formulated to evaluate existing M-commerce websites. The inappropriate application of usability methods will result in major usability problems which will, in turn, negatively impact users’ experiences. To facilitate improved M-commerce user experiences, this study set out to determine appropriate attributes of usability model as well as formulate a domain-specific usability evaluation method to ascertain the usability of said websites. The research work applied a combination of a user-based evaluation method and the proposed domain-specific evaluation method to evaluate the usability of four selected M-commerce websites. The outcomes of the study, which aided in the development of a framework for the usability evaluation of M-commerce websites, highlighted the effectiveness of the methods. Therefore, the proposed framework will prove useful to both new, and well-established M-commerce providers, as it will help guide usability professionals as to which evaluation method to choose for a specific usability problem area when evaluating the usability of M-commerce websites.
Chapter
The evaluation of the performance of interactive multimedia retrieval systems is a methodologically non-trivial endeavour and requires specialized infrastructure. Current evaluation campaigns have so far relied on a local setting, where all retrieval systems needed to be evaluated at the same physical location at the same time. This constraint does not only complicate the organization and coordination but also limits the number of systems which can reasonably be evaluated within a set time frame. Travel restrictions might further limit the possibility for such evaluations. To address these problems, evaluations need to be conducted in a (geographically) distributed setting, which was so far not possible due to the lack of supporting infrastructure. In this paper, we present the Distributed Retrieval Evaluation Server (DRES), an open-source evaluation system to facilitate evaluation campaigns for interactive multimedia retrieval systems in both traditional on-site as well as fully distributed settings which has already proven effective in a competitive evaluation.
Chapter
We present IVOS, an interactive video content search system that allows for object-based search and filtering in video archives. The main idea behind is to use the result of recent object detection models to index all keyframes with a manageable set of object classes, and allow the user to filter by different characteristics, such as object name, object location, relative object size, object color, and combinations for different object classes – e.g., “large person in white on the left, with a red tie”. In addition to that, IVOS can also find segments with a specific number of objects of a particular class (e.g., “many apples” or “two people”) and supports similarity search, based on similar object occurrences.
Chapter
Exquisitor is a scalable media exploration system based on interactive learning, which first took part in VBS in 2020. This paper presents an extension to Exquisitor, which supports operations on semantic classifiers to solve VBS tasks with temporal constraints. We outline the approach and present preliminary results, which indicate the potential of the approach.
Chapter
We present our NoShot Video Browser, which has been successfully used at the last Video Browser Showdown competition VBS2020 at the MMM2020. NoShot is given its name due to the fact, that it neither makes use of any kind of shot detection nor utilize the VBS master shots. Instead videos are split into frames with a time distance of one second. The biggest strength of the system lies in its feature “time cache”, which shows results with the best confidence in a range of seconds.
Chapter
As a longstanding participating system in the annual Video Browser Showdown (VBS2017-VBS2020) as well as in two iterations of the more recently established Lifelog Search Challenge (LSC2018-LSC2019), diveXplore is developed as a feature-rich Deep Interactive Video Exploration system. After its initial successful employment as a competitive tool at the challenges, its performance, however, declined as new features were introduced increasing its overall complexity. We mainly attribute this to the fact that many additions to the system needed to revolve around the system’s core element – an interactive self-organizing browseable featuremap, which, as an integral component did not accommodate the addition of new features well. Therefore, counteracting said performance decline, the VBS 2021 version constitutes a completely rebuilt version 5.0, implemented from scratch with the aim of greatly reducing the system’s complexity as well as keeping proven useful features in a modular manner.
Chapter
Concept-free search, which embeds text and video signals in a joint space for retrieval, appears to be a new state-of-the-art. However, this new search paradigm suffers from two limitations. First, the search result is unpredictable and not interpretable. Second, the embedded features are in high-dimensional space hindering real-time indexing and search. In this paper, we present a new implementation of the Vireo video search system (Vireo-VSS), which employs a dual-task model to index each video segment with an embedding feature in a low dimension and a concept list for retrieval. The concept list serves as a reference to interpret its associated embedded feature. With these changes, a SQL-like querying interface is designed such that a user can specify the search content (subject, predicate, object) and constraint (logical condition) in a semi-structured way. The system will decompose the SQL-like query into multiple sub-queries depending on the constraint being specified. Each sub-query is translated into an embedding feature and a concept list for video retrieval. The search result is compiled by union or pruning of the search lists from multiple sub-queries. The SQL-like interface is also extended for temporal querying, by providing multiple SQL templates for users to specify the temporal evolution of a query.
Chapter
This paper presents an enhanced version of an interactive video retrieval tool SOMHunter that won Video Browser Showdown 2020. The presented enhancements focus on improving text querying capabilities since the text search model plays a crucial part in successful searches. Hence, we introduce the ability to specify multiple text queries with further positional specification so users can better describe positional relationships of the objects. Moreover, a possibility to further specify text queries with an example image is introduced as well as consequent changes to the user interface of the tool.
Chapter
This paper presents a new version of the Interactive VIdeo Search Tool (IVIST), a video retrieval tool, for the participation of the Video Browser Showdown (VBS) 2021. In the previous IVIST (VBS 2020), there were core functions to search for videos practically, such as object detection, scene-text recognition, and dominant-color finding. Including core functions, we newly supplement other helpful functions to deal with finding videos more effectively: action recognition, place recognition, and description searching methods. These features are expected to enable a more detailed search, especially for human motion and background description which cannot be covered by the previous IVIST system. Furthermore, the user interface has been enhanced in a more user-friendly way. With these enhanced functions, a new version of IVIST can be practical and widely-used for actual users.
Chapter
The W2VV++ model BoW variant integrated to VIRET and SOMHunter systems has proven its effectiveness in the previous Video Browser Showdown competition in 2020. As a next experimental interactive search prototype to benchmark, we consider a simple system relying on the more complex BERT variant of the W2VV++ model, accepting a rich text input. The input can be provided by keyboard or by speech processed by a third-party cloud service. The motivation for the more complex BERT variant is its good performance for rich text descriptions that can be provided for known-item search tasks. At the same time, users will be instructed to specify as rich text description about the searched scene as possible.
Chapter
Nowadays, popular web search portals enable users to find available images corresponding to a provided free-form text description. With such sources of example images, a suitable composition/collage of images can be constructed as an appropriate visual query input to a known-item search system. In this paper, we investigate a querying approach enabling users to search videos with a multi-query consisting of positioned example images, so-called collage query, depicting expected objects in a searched scene. The approach relies on images from external search engines, partitioning of preselected representative video frames, relevance scoring based on deep features extracted from images/frames, and is currently integrated into the open-source version of the SOMHunter system providing additional browsing capabilities.
Conference Paper
In this paper, we present the iteration of the multimedia retrieval system vitrivr participating at LSC 2022. vitrivr is a general-purpose retrieval system which has previously participated at LSC. We describe the system architecture and functionality, and show initial results based on the test and validation topics.
Chapter
In the ongoing multimedia age, search needs become more variable and challenging to aid. In the area of content-based similarity search, asking search engines for one or just a few nearest neighbours to a query does not have to be sufficient to accomplish a challenging search task. In this work, we investigate a task type where users search for one particular multimedia object in a large database. Complexity of the task is empirically demonstrated with a set of experiments and the need for a larger number of nearest neighbours is discussed. A baseline approach for finding a larger number of approximate nearest neighbours is tested, showing potential speed-up with respect to a naive sequential scan. Last but not least, an open efficiency challenge for metric access methods is discussed for datasets used in the experiments.
Article
This article conducts user evaluation to study the performance difference between interactive and automatic search. Particularly, the study aims to provide empirical insights of how the performance landscape of video search changes, with tens of thousands of concept detectors freely available to exploit for query formulation. We compare three types of search modes: free-to-play (i.e., search from scratch), non-free-to-play (i.e., search by inspecting results provided by automatic search), and automatic search including concept-free and concept-based retrieval paradigms. The study involves a total of 40 participants; each performs interactive search over 15 queries of various difficulty levels using two search modes on the IACC.3 dataset provided by TRECVid organizers. The study suggests that the performance of automatic search is still far behind interactive search. Furthermore, providing users with the result of automatic search for exploration does not show obvious advantage over asking users to search from scratch. The study also analyzes user behavior to reveal insights of how users compose queries, browse results, and discover new query terms for search, which can serve as guideline for future research of both interactive and automatic search.
Conference Paper
Full-text available
We present an image search engine that allows searching by similarity about 100M images included in the YFCC100M dataset, and annotate query images. Image similarity search is performed using YFCC100M-HNfc6, the set of deep features we extracted from the YFCC100M dataset, which was indexed using the MI-File index for efficient similarity searching. A metadata cleaning algorithm, that uses visual and textual analysis, was used to select from the YFCC100M dataset a relevant subset of images and associated annotations, to create a training set to perform automatic textual annotation of submitted queries. The on-line image and annotation system demonstrates the effectiveness of the deep features for assessing conceptual similarity among images, the effectiveness of the metadata cleaning algorithm, to identify a relevant training set for annotation, and the efficiency and accuracy of the MI-File similarity index techniques, to search and annotate using a dataset of 100M images, with very limited computing resources.
Conference Paper
Full-text available
We present a new approach to visually browse very large sets of untagged images. High quality image features are generated using transformed activations of a convolutional neural network. These features are used to model image similarities, from which a hierarchical image graph is build. We show how such a graph can be constructed efficiently. In our experiments we found best user experience for navigating the graph is achieved by projecting sub-graphs onto a regular 2D image map. This allows users to explore the image collection like an interactive map.
Conference Paper
Full-text available
This extended demo paper summarizes our interface used for the Video Browser Showdown (VBS) 2017 competition, where visual and textual known-item search (KIS) tasks, as well as ad-hoc video search (AVS) tasks in a 600-h video archive need to be solved interactively. To this end, we propose a very flexible distributed video search system that combines many ideas of related work in a novel and collaborative way, such that several users can work together and explore the video archive in a complementary manner. The main interface is a perspective Feature Map, which shows keyframes of shots arranged according to a selected content similarity feature (e.g., color, motion, semantic concepts, etc.). This Feature Map is accompanied by additional views, which allow users to search and filter according to a particular content feature. For collaboration of several users we provide a cooperative heatmap that shows a synchronized view of inspection actions of all users. Moreover, we use collaborative re-ranking of shots (in specific views) based on retrieved results of other users.
Conference Paper
Full-text available
vitrivr is an open source full-stack content-based multimedia retrieval system with focus on video. Unlike the majority of the existing multimedia search solutions, vitrivr is not limited to searching in metadata, but also provides content-based search and thus offers a large variety of different query modes which can be seamlessly combined: Query by sketch, which allows the user to draw a sketch of a query image and/or sketch motion paths, Query by example, keyword search, and relevance feedback. The vitrivr architecture is self-contained and addresses all aspects of multimedia search, from offline feature extraction, database management to frontend user interaction. The system is composed of three modules: a web-based frontend which allows the user to input the query (e.g., add a sketch) and browse the retrieved results (vitrivr-ui), a database system designed for interactive search in large-scale multimedia collections (ADAM), and a retrieval engine that handles feature extraction and feature-based retrieval (Cineast). The vitrivr source is available on GitHub under the MIT open source (and similar) licenses and is currently undergoing several upgrades as part of the Google Summer of Code 2016.
Article
Full-text available
Interactive video retrieval tools developed over the past few years are emerging as powerful alternatives to automatic retrieval approaches by giving the user more control as well as more responsibilities. Current research tries to identify the best combinations of image, audio and text features that combined with innovative UI design maximize the tools performance. We present the last installment of the Video Browser Showdown 2015 which was held in conjunction with the International Conference on MultiMedia Modeling 2015 (MMM 2015) and has the stated aim of pushing for a better integration of the user into the search process. The setup of the competition including the used dataset and the presented tasks as well as the participating tools will be introduced . The performance of those tools will be thoroughly presented and analyzed. Interesting highlights will be marked and some predictions regarding the research focus within the field for the near future will be made.
Article
Full-text available
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. Finally, given the recent surge of interest in this task, a competition was organized in 2015 using the newly released COCO dataset. We describe and analyze the various improvements we applied to our own baseline and show the resulting performance in the competition, which we won ex-aequo with a team from Microsoft Research.
Article
Full-text available
Digital video enables manifold ways of multimedia content interaction. Over the last decade, many proposals for improving and enhancing video content interaction were published. More recent work particularly leverages on highly capable devices such as smartphones and tablets that embrace novel interaction paradigms, e.g. touch, gesture-based or physical content interaction. In this paper, we survey literature at the intersection of Human-Computer Interaction and Multimedia. We integrate literature from video browsing and navigation, direct video manipulation, video content visualization, as well as interactive video summariza-tion and interactive video retrieval. We classify the reviewed works by the underlying interaction method and discuss the achieved improvements so far. We also depict a set of open problems that the video interaction community should address in the next years.
Article
Full-text available
Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.
Article
Full-text available
The Video Browser Showdown is an international competition in the field of interactive video search and retrieval. It is held annually as a special session at the International Conference on Multimedia Modeling (MMM). The Video Browser Showdown evaluates the performance of exploratory tools for interactive content search in videos in direct competition and in front of an audience. Its goal is to push research on user-centric video search tools including video navigation, content browsing, content interaction, and video content visualization. This article summarizes the first three VBS competitions (2012-2014).
Article
Full-text available
What started as a field with an emphasis on optimally serving users' interactive information needs has now become dominated by methods that focus on improving the mean average precision (MAP) of a clearly defined task disconnected from its application. With the pervasiveness of the Internet and all the sensors available to derive contextual user information, it is time to bring the data and the user back together. As a field, we must consider understanding the subjective and descriptive nature of users and understanding data as equally interesting research topics that are both worthy of publication. At the 2012 ACM Second Annual International Conference on Multimedia Retrieval (ICMR) in Hong Kong, a panel took place with Marcel Worring as the moderator and the other authors of this article as the panelists. This panel discussion explored this intriguing question: Where is the user in multimedia retrieval?
Article
Full-text available
We present a comprehensive review of the state of the art in video browsing and retrieval systems, with special emphasis on interfaces and applications. There has been a significant increase in activity (e.g., storage, retrieval, and sharing) employing video data in the past decade, both for personal and professional use. The ever-growing amount of video content available for human consumption and the inherent characteristics of video data-which, if presented in its raw format, is rather unwieldy and costly-have become driving forces for the development of more effective solutions to present video contents and allow rich user interaction. As a result, there are many contemporary research efforts toward developing better video browsing solutions, which we summarize. We review more than 40 different video browsing and retrieval interfaces and classify them into three groups: applications that use video-player-like interaction, video retrieval applications, and browsing solutions based on video surrogates. For each category, we present a summary of existing work, highlight the technical aspects of each solution, and compare them against each other.
Article
Full-text available
Video indexing, also called video concept detection, has attracted increasing attentions from both academia and industry. To reduce human labeling cost, active learning has been introduced to video indexing recently. In this paper, we propose a novel active learning approach based on the optimum experimental design criteria in statistics. Different from existing optimum experimental design, our approach simultaneously exploits sample's local structure, and sample relevance, density, and diversity information, as well as makes use of labeled and unlabeled data. Specifically, we develop a local learning model to exploit the local structure of each sample. Our assumption is that for each sample, its label can be well estimated based on its neighbors. By globally aligning the local models from all the samples, we obtain a local learning regularizer, based on which a local learning regularized least square model is proposed. Finally, a unified sample selection approach is developed for interactive video indexing, which takes into account the sample relevance, density and diversity information, and sample efficacy in minimizing the parameter variance of the proposed local learning regularized least square model. We compare the performance between our approach and the state-of-the-art approaches on the TREC video retrieval evaluation (TRECVID) benchmark. We report superior performance from the proposed approach.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
Extending beyond the boundaries of science, art, and culture, content-based multimedia information retrieval provides new paradigms and methods for searching through the myriad variety of media over the world. This survey reviews 100+ recent articles on content-based multimedia information retrieval and discusses their role in current research directions which include browsing and search paradigms, user studies, affective computing, learning, semantic queries, new features and media types, high performance indexing, and evaluation techniques. Based on the current state of the art, we discuss the major challenges for the future.
Article
Full-text available
We propose the time interval multimedia event (TIME) framework as a robust approach for classification of semantic events in multimodal video documents. The representation used in TIME extends the Allen temporal interval relations and allows for proper inclusion of context and synchronization of the heterogeneous information sources involved in multimodal video analysis. To demonstrate the viability of our approach, it was evaluated on the domains of soccer and news broadcasts. For automatic classification of semantic events, we compare three different machine learning techniques, i.c. C4.5 decision tree, maximum entropy, and support vector machine. The results show that semantic video indexing results significantly benefit from using the TIME framework.
Chapter
This chapter describes different image sorting algorithms. It introduces a new measure, which is better suited to evaluate two‐dimensional (2D) image arrangements. The chapter presents a modified self‐sorting maps (SSM) algorithm with improved sorting quality and reduced complexity, which allows millions of images to be sorted very quickly. Graph‐based approaches can handle changes in the image collection. The chapter also presents a new graph‐based approach to visually browse very large sets of varying images. It shows how high‐quality image features representing the image content can be generated using transformed activations of a convolutional neural network. The chapter then presents an overview of various visual browsing models for image exploration. A self‐organizing map (SOM) is an artificial neural network that is trained using unsupervised learning to produce a lower dimensional, discrete representation of the input space, called a map. A SOM consists of components called nodes.
Conference Paper
Known-item search in multimodal lifelog data represents a challenging task for present search engines. Since sequences of temporally close images represent a significant part of the provided data, an interactive video retrieval tool with few extensions could be confronted at Lifelog Search Challenge in known-item search tasks. We present an update of the SIRET interactive video retrieval tool that recently won the Video Browser Showdown 2018. As the tool relies on frame-based representations and retrieval models, it can be directly used also for images from lifelog cameras. The updates comprise mostly visualization and navigation methods for a high number of visually similar scenes representing repetitive daily activities.
Article
The last decade has seen innovations that make video recording, manipulation, storage and sharing easier than ever before, thus impacting many areas of life. New video retrieval scenarios emerged as well, which challenge the state-of-the-art video retrieval approaches. Despite recent advances in content analysis, video retrieval can still benefit from involving the human user in the loop. We present our experience with a class of interactive video retrieval scenarios and our methodology to stimulate the evolution of new interactive video retrieval approaches. More specifically, the Video Browser Showdown evaluation campaign is thoroughly analyzed, focusing on the years 2015-2017. Evaluation scenarios, objectives and metrics are presented, complemented by the results of the annual evaluations. The results reveal promising interactive video retrieval techniques adopted by the most successful tools and confirm assumptions about the different complexity of various types of interactive retrieval scenarios. A comparison of the interactive retrieval tools with automatic approaches (including fully automatic and manual query formulation) participating in the TRECVID 2016 Ad-hoc Video Search (AVS) task is discussed. Finally, based on the results of data analysis, a substantial revision of the evaluation methodology for the following years of the Video Browser Showdown is provided.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.
Article
The Benchmarking Initiative for Multimedia Evaluation (MediaEval) organizes an annual cycle of scientific evaluation tasks in the area of multimedia access and retrieval. The tasks offer scientific challenges to researchers working in diverse areas of multimedia technology. The tasks, which are focused on the social and human aspects of multimedia, help the research community tackle challenges linked to less widely studied user needs. They also support researchers in investigating the diversity of perspectives that naturally arise when users interact with multimedia content. Here, the authors present highlights from the 2016 workshop.
Conference Paper
Our successful multimedia event detection system at TREC-VID 2015 showed its strength on handling complex concepts in a query. The system was based on a large number of pre-trained concept detectors for textual-to-visual relation. In this paper, we enhance the system by enabling human-in-the-loop. In order to facilitate a user to quickly find an information need, we incorporate concept screening, video reranking by highlighted concepts, relevance feedback and color sketch to refine a coarse retrieval result. The aim is to eventually come up with a system suitable for both Ad-hoc Video Search and Known-Item Search. In addition, as the increasing awareness of difficulty in distinguishing shots of very similar scenes, we also explore the automatic story annotation along the timeline of a video, so that a user can quickly grasp the story happened in the context of a target shot and reject shots with incorrect context. With the story annotation, a user can refine the search result as well by simply adding a few keywords in a special “context field” of a query.
Conference Paper
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62 % error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https:// github. com/ KaimingHe/ resnet-1k-layers.
Conference Paper
We present a novel approach to browse huge sets of video scenes using a hierarchical graph and visually sorted image maps allowing the user to explore the graph similar to navigation services. In a previous paper [1] we proposed a scheme to generate such a graph of video scenes and investigated several browsing and visualization concepts. In this paper we extend our work by adding semantic features learned from a convolutional neural network. In combination with visual features we constructed an improved graph where related images (video scenes) are connected with each other. Different images or areas in the graph may be reached by following the most promising path of edges. For efficient navigation we propose a method which projects images onto a 2D plane preserving their complex inter-image relationships. To start a search process, the user may either choose from a selection of typical videos scenes or use tools such as search by sketch or category. The retrieved video frames are arranged on a canvas and the view of the graph is directed to a location where matching frames can be found.
Article
For supporting retrieval tasks within large multimedia collections, not only the sheer size of data but also the complexity of data and their associated metadata pose a challenge. Applications that have to deal with big multimedia collections need to manage the volume of data and to effectively and efficiently search within these data. When providing similarity search, a multimedia retrieval system has to consider the actual multimedia content, the corresponding structured metadata (e.g., content author, creation date, etc.) and—for providing similarity queries—the extracted low-level features stored as densely populated high-dimensional feature vectors. In this paper, we present ADAMpro, a combined database and information retrieval system that is particularly tailored to big multimedia collections. ADAMpro follows a modular architecture for storing structured metadata, as well as the extracted feature vectors and it provides various index structures, i.e., Locality-Sensitive Hashing, Spectral Hashing, and the VA-File, for a fast retrieval in the context of a similarity search. Since similarity queries are often long-running, ADAMpro supports progressive queries that provide the user with streaming result lists by returning (possibly imprecise) results as soon as they become available. We provide the results of an evaluation of ADAMpro on the basis of several collection sizes up to 50 million entries and feature vectors with different numbers of dimensions.
Article
We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings.
Conference Paper
We present a graph-based browsing system for visually searching video clips in large collections. It is an extension of a previously proposed system ImageMap which allows visual browsing in millions of images using a hierarchical pyramid structure of images sorted by their similarities. Image subsets can be explored through a viewport at different pyramid levels, however, due to the underlying 2D-organization the high dimensional relationships between all images could not be represented. In order to preserve the complex inter-image relationships we propose to use a hierarchical graph where edges connect related images. By traversing this graph the users may navigate to other similar images. Different visualization and navigation modes are available. Various filters and search tools such as search by example, color, or sketch may be applied. These tools help to narrow down the amount of video frames to be inspected or to direct the view to regions of the graph where matching frames are located.
Article
Despite the tremendous importance and availability of large video collections, support for video retrieval is still rather limited and is mostly tailored to very concrete use cases and collections. In image retrieval, for instance, standard keyword search on the basis of manual annotations and content-based image retrieval, based on the similarity to query image (s), are well established search paradigms, both in academic prototypes and in commercial search engines. Recently, with the proliferation of sketch-enabled devices, also sketch-based retrieval has received considerable attention. The latter two approaches are based on intrinsic image features and rely on the representation of the objects of a collection in the feature space. In this paper, we present Cineast, a multi-feature sketch-based video retrieval engine. The main objective of Cineast is to enable a smooth transition from content-based image retrieval to content-based video retrieval and to support powerful search paradigms in large video collections on the basis of user-provided sketches as query input. Cineast is capable of retrieving video sequences based on edge or color sketches as query input and even supports one or multiple exemplary video sequences as query input. Moreover, Cineast also supports a novel approach to sketch-based motion queries by allowing a user to specify the motion of objects within a video sequence by means of (partial) flow fields, also specified via sketches. Using an emergent combination of multiple different features, Cineast is able to universally retrieve video (sequences) without the need for prior knowledge or semantic understanding. The evaluation with a general purpose video collection has shown the effectiveness and the efficiency of the Cineast approach.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article.
Article
Usability evaluation is an increasingly important part of the user interface design process. However, usability evaluation can be expensive in terms of time and human resources, and automation is therefore a promising way to augment existing approaches. This article presents an extensive survey of usability evaluation methods, organized according to a new taxonomy that emphasizes the role of automation. The survey analyzes existing techniques, identifies which aspects of usability evaluation automation are likely to be of use in future research, and suggests new ways to expand existing approaches to better support usability evaluation. Categories and Subject Descriptors: H.1.2 [Information Systems]: User/Machine Systems - human factors; human information processing; H.5.2 [Information Systems]: User Interfaces - benchmarking; evaluation/methodology; graphical user interfaces (GUI) Human Factors.
Conference Paper
The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.
Article
In this paper, a subspace-based multimedia data mining framework is proposed for video semantic analysis, specifically video event/concept detection, by addressing two basic issues, i.e., semantic gap and rare event/concept detection. The proposed framework achieves full automation via multimodal content analysis and intelligent integration of distance-based and rule-based data mining techniques. The content analysis process facilitates the comprehensive video analysis by extracting low-level and middle-level features from audio/visual channels. The integrated data mining techniques effectively address these two basic issues by alleviating the class imbalance issue along the process and by reconstructing and refining the feature dimension automatically. The promising experimental performance on goal/corner event detection and sports/commercials/building concepts extraction from soccer videos and TRECVID news collections demonstrates the effectiveness of the proposed framework. Furthermore, its unique domain-free characteristic indicates the great potential of extending the proposed multimedia data mining framework to a wide range of different application domains.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
  • Kaiming Shaoqing Ren
  • Ross He
  • Jian Girshick
  • Sun
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Neural Information Processing Systems (NIPS).
Mastering ElasticSearch. Packt Publishing. Marek Rogozinski Rafal Kuc. 2013. Mastering ElasticSearch
  • Marek Rogozinski Rafal Kuc
Marek Rogozinski Rafal Kuc. 2013. Mastering ElasticSearch. Packt Publishing.
Enhanced VIREO KIS at VBS 2018
  • Yi-Jie Nguyen
  • Hao Lu
  • Chong-Wah Zhang
  • Ngo