Bag-of-visual-words vs global image descriptors on two-stage multimodal retrieval
DOI: 10.1145/2009916.2010144 Conference: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011
The Bag-Of-Visual-Words (BOVW) paradigm is fast becoming a popular image representation for Content-Based Image Retrieval (CBIR), mainly because of its better retrieval effectiveness over global feature representations on collections with images being near-duplicate to queries. In this experimental study we demonstrate that this advantage of BOVW is diminished when visual diversity is enhanced by using a secondary modality, such as text, to pre-filter images. The TOP-SURF descriptor is evaluated against Compact Composite Descriptors on a two-stage image retrieval setup, which first uses a text modality to rank the collection and then perform CBIR only on the top-K items.
Get notified about updates to this publicationFollow publication
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.
- [Show abstract] [Hide abstract]
ABSTRACT: Due to the rapid development of information technology and the continuously increasing number of available multimedia data, the task of retrieving information based on visual content has become a popular subject of scientific interest. Recent approaches adopt the bag-of-visual-words (BOVW) model to retrieve images in a semantic way. BOVW has shown remarkable performance in content-based image retrieval tasks, exhibiting better retrieval effectiveness over global and local feature (LF) representations. The performance of the BOVW approach depends strongly, however, on predicting the ideal codebook size, a difficult and database-dependent task. The contribution of this paper is threefold. First, it presents a new technique that uses a self-growing and self-organized neural gas network to calculate the most appropriate size of a codebook for a given database. Second, it proposes a new soft-weighting technique, whereby each LF is classified into only one visual word (VW) with a degree of participation. Third, by combining the information derived from the method that automatically detects the number of VWs, the soft-weighting method, and a color information extraction method from the literature, it shapes a new descriptor, called color VWs. Experimental results on two well-known benchmarking databases demonstrate that the proposed descriptor outperforms 15 contemporary descriptors and methods from the literature, in terms of both precision at K and its ability to retrieve the entire ground truth.
- [Show abstract] [Hide abstract]
ABSTRACT: Mobile devices such as smartphones and tablets are widely used in everyday life to perform a variety of operations, such as e-mail exchange, connection to social media, bank/financial transactions, and so on. Moreover, because of the large growth of multimedia applications, video and image transferring and sharing via a wireless network is becoming increasingly popular. Several modern mobile applications perform information retrieval and image recognition. For example, Google Goggles is an image recognition application that is used for searches based on pictures taken by handheld devices. In most of the cases, image recognition procedure is an image retrieval procedure. The captured images or a low-level description of them are uploaded online, and the system recognizes their content by retrieving visually similar pictures. Taking into account the last comment, our goal in this paper is to evaluate the process of image retrieval/recognition over an Institute of Electrical and Electronics Engineers 802.11b network, operating at 2.4 GHz. Our evaluation is performed through a simulated network configuration, which consists of a number of mobile nodes communicating with an access point. Throughout our simulations, we examine the impact of several factors, such as the existence of a strong line of sight during the communication between wireless devices. Strong line of sight depends on the fading model used for the simulations and has an effect on BER. We have used a large number of image descriptors and a variety of scenarios, reported in the relative literature, in order to comprehensively evaluate our system. To reinforce our results, experiments were conducted on two well-known images databases by using 10 descriptors from the literature. Copyright © 2014 John Wiley & Sons, Ltd.
- [Show abstract] [Hide abstract]
ABSTRACT: In this paper, we focus on implementing the extraction of a well-known low-level image descriptor using the multicore power provided by general-purpose graphic processing units (GPGPUs). The color and edge directivity descriptor, which incorporates both color and texture information achieving a successful trade-off between effectiveness and efficiency, is employed and reassessed for parallel execution. We are motivated by the fact that image/frame indexing should be achieved real time, which in our case means that a system should be capable of indexing a frame or an image as it becomes part of a database (ideally, calculating the descriptor as the images are captured). Two strategies are explored to accelerate the method and bypass resource limitations and architectural constrains. An approach that exclusively uses the GPU together with a hybrid implementation that distributes the computations to both available GPU and CPU resources are proposed. The first approach is strongly based on the compute unified device architecture and excels compared to all other solutions when the GPU resources are abundant. The second implementation suggests a hybrid scheme where the extraction process is split in two sequential stages, allowing the input data (images or video frames) to be pipelined through the central and the graphic processing units. Experimental results were conducted on four different combinations of GPU–CPU technologies in order to highlight the strengths and the weaknesses of all implementations. Real-time indexing is obtained over all computational setups for both GPU-only and Hybrid techniques. An impressive 22 times acceleration is recorded for the GPU-only method. The proposed Hybrid implementation outperforms the GPU-only implementation and becomes the preferred solution when a low-cost setup (i.e., more advanced CPU combined with a relatively weak GPU) is employed.