Conference Paper

Multi-Modal Image Retrieval for Complex Queries using Small Codes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We propose a unified framework for image retrieval capable of handling complex and descriptive queries of multiple modalities in a scalable manner. A novel aspect of our approach is that it supports query specification in terms of objects, attributes and spatial relationships, thereby allowing for substantially more complex and descriptive queries. We allow these complex queries to be specified in three different modalities - images, sketches and structured textual descriptions. Furthermore, we propose a unique multi-modal hashing algorithm capable of mapping queries of different modalities to the same binary representation, enabling efficient and scalable image retrieval based on multi-modal queries. Extensive experimental evaluation shows that our approach outperforms the state-of-the-art image retrieval and hashing techniques on the MSRC and SUN09 datasets by about 100%, while the performance on a dataset of 1M images, from Flickr, demonstrates its scalability.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We have represented features blocks by their proportion of overlap with ROI in calculation of similarity measure. We consider the problem of semantic preserving distance metric learning based on patch alignment scheme [21,25,72,76]. Moreover researcher in [46,48] has used geo-tagged datasets from flicker and presumes diverse views of the same view in each of these datasets. ...
... Previously, it has been known that, these applications would deploy keyword based search operations. The new research in search engines has incorporated to capabilities for integrating content based descriptor that may derive from pixel data of the pictures into their retrieval algorithm [18,72,76]. So one can say about this approach as content-based image retrieval (CBIR). ...
... By looking on to query results, user is going to set knobs and send a latest query which may base on actual image or one of its reverted image. It has been observed that user may also tag the recovered image as good or bad comparison, So that to access more information to the retrieval algorithm, this process has been known as relevance feedback [72,79]. It has been studied that there are two problems of such interfaces are, one is a lack of user control and lack of information regarding the computers view of the image. ...
Article
Full-text available
Information retrieval systems are getting more attention in the era of multimedia technologies such as an image, video, audio and text files. The large numbers of images are challenges in computer systems field to store, manage data effectively and efficiently. The shape retrieval feature of different objects in the image also remains a difficult problem due to distinct angle view of different objects in a scene only; few studies have reported solution to the problem of finding relative locations of ROIs. In this paper, we proposed three methods such as1. Geolocation-based image retrieval (GLBIR), 2.Unsupervised feature technique Principal component analysis (PCA) and 3.multiple region-based image retrieval. The first proposed (GLBIR) method identifies geo location an image using visual attention based mechanism and its color layout descriptors. These features are extracted from geo-location of query image from Flickr database. Our proposed model does not fully semantic understanding of image content, uses visual metrics for example; the proximity ,color contrast, size and nearness to image’s boundaries to locate viewer’s attention. We analyzed results and compared with state of art CBIR Systems and GLBIR Technique. Our second method to refine images exploiting and fusing by unsupervised feature technique using principal component analysis (PCA). The visually similar images clustering together with analyses image retrieval process and remove outliers initially retrieved image set by PCA. To evaluation our proposed approach, we used thousands of images downloaded from Flickr and CIFAR-10 databases using Flickr public API. Finally, we determinately proposed a system for image retrieval based on region. It provides a user interface for availing to designate the watershed ROI within an input image. During the retrieval of images, regions’ feature vectors having codes of region homogeneous to a region of input image are utilized for comparison. Standard datasets are used for evaluation of proposed approach. The experiment demonstrates and effectiveness of the proposed method to achieve higher annotation performance increases accuracy and reduces image retrieval time. We evaluated our proposed approach on images dataset from Flickr and CIFAR-10.
... The concepts in a text do not appear in isolation rather, they interact with each other providing semantics to the text. These interactions can help bridge the semantic gap in the image search process Figure 1: Top four retrieved images from a popular search engine for queries (a) a boy walking with an old man and a dog in a park (b) a baby with an apple lying in the bed A few attempts [12][13][14][15][16][17]31] have been made to tackle the problem of complex queries. Nie et al. [12] proposed an approach which exploits web sources, apart from visual and textual information for query-image relevance estimation. ...
... However, images are not always surrounded by sentences and are sometimes labeled only with a few tags. Siddiquie et al. [17] proposed a unified framework which mainly exploits the objects' spatial relationships for more descriptive queries. Recently, Chen et al. [31] used K-parser to parse the query and text associated with the images. ...
... Concept detectors are built by leveraging visual features and then the weights of each concept in a query are learnt. Siddiquie et al. [17] proposed a framework which handles queries consisting of objects, attributes and spatial relationships. They proposed a unique multi-modal hashing algorithm capable of mapping queries of different modalities to the same binary sematic representation. ...
Conference Paper
With the rising prevalence of social media, coupled with the ease of sharing images, people with specific needs and applications such as known item search, multimedia question answering, etc., have started searching for visual content, which is expressed in terms of complex queries. A complex query consists of multiple concepts and their attributes are arranged to convey semantics. It is less effective to answer such queries by simply appending the search results gathered from individual or subsets of concepts present in the query. In this paper, we propose to exploit the query constituents and relationships among them. The proposed approach finds image-query relevance by integrating three models - the linguistic pattern-based textual model, the visual model, and the cross modality model. We extract linguistic patterns from complex queries, gather their related crawled images, and assign relevance scores to images in the corpus. The relevance scores are then used to rank the images. We experiment on more than 140k images and compare the [email protected] scores with the state-of-the-art image ranking methods for complex queries. Also, ranking of images obtained by our approach outperforms than that of obtained by a popular search engine.
... To this end, this work proposes a multimedia retrieval framework that first combines the multiple views of each modality and then fuses the multiple modalities to generate a single ranking. Motivated by the Partial Least Squares (PLS) approach [24], due to its effectiveness in multimodal hashing, we adopt PLS Regression to combine multiple views of the same modality, such as SIFT descriptors [18] and visual features based on Deep Convolutional Neural Networks [20] as far as low-level visual descriptors are concerned, and then employ a non-linear graph-based method for fusing the different modalities. The effectiveness of the proposed framework is compared to several baseline methods in unsupervised multimedia retrieval, such as weighted linear, non-linear, diffusion-based and advanced graph-based models; the evaluation is performed using three publicly available datasets. ...
... Some other methodologies are Partial Least Squares (PLS) and correlation matching. With respect to the former, a PLS-based framework, which maps queries from multiple modalities to points in a common linear subspace, is introduced in [24]. Regarding the latter, Rasiwasia et al. [21] utilizes correlation matching between the textual and visual modalities of multimedia documents in the task of cross-modal document retrieval. ...
... The Partial Least Squares (PLS) model has been used in multimodal fusion [24] to efficiently combine two modalities. Given two matrices X and Y , PLS decomposes X and Y as follows: ...
Article
Full-text available
Heterogeneous sources of information, such as images, videos, text and metadata are often used to describe different or complementary views of the same multimedia object, especially in the online news domain and in large annotated image collections. The retrieval of multimedia objects, given a multimodal query, requires the combination of several sources of information in an efficient and scalable way. Towards this direction, we provide a novel unsupervised framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual metadata, integrating non-linear graph-based fusion and Partial Least Squares Regression. The fusion strategy is based on the construction of a multimodal contextual similarity matrix and the non-linear combination of relevance scores from query-based similarity vectors. Our framework can employ more than two modalities and high-level information, without increase in memory complexity, when compared to state-of-the-art baseline methods. The experimental comparison is done in three public multimedia collections in the multimedia retrieval task. The results have shown that the proposed method outperforms the baseline methods, in terms of Mean Average Precision and Precision@20.
... Another approach that aims at efficient cross-modal retrieval is by using correlation matching [14] between the two modalities. In addition, a Partial Least Squares (PLS) based approach is followed in [15], in which different modalities of the data are mapped into a common latent space. This methodology is evaluated in the image retrieval task. ...
... ( ) = 1 1 + 2 2 + 3 3 + ′ 1 ( ) 1 + ′ 2 ( ) 2 + ′ 3 ( ) 3 (15)  Graph-based with non-linear fusion: ...
... Fusion is popular in the multimedia retrieval field as it can effectively complement information from different modalities (such as images, videos, and text) [45], [46], [47], [48], [49]. Roughly, the idea of fusion is the combination of multiple heterogeneous feature sets that are extracted from multiple modalities describing the same scene from different views. ...
... In this work, we address the object retrieval problem by mimicking fusion employed in multimedia retrieval. Thus, a distinctive difference between our method with aforementioned works [45], [46], [47], [48], [49] is that, we integrate similarities of two representations derived from the same visual descriptors; while they combine several sources of information produced from different modalities. Our work bears some similarity to [49] in the viewpoint of using random walk. ...
Article
Re-ranking is an essential step for accurate image retrieval, due to its well known power in performance improvement. Although numerous works have been proposed for re-ranking, many of them are only customized for a certain image representation model. In contrast to most existing techniques, we develop generalized re-ranking algorithms that are applicable to different kinds of image encodings in this paper. We first employ a quite successful theory of similarity propagation to reconstruct vectors of a query and its top ranked images, and subsequently get a re-ranked list by comparing the new image vectors. Furthermore, considering that the just mentioned strategy is directly compatible with query expansion, and thus in order to leverage advantages of this milestone, we then propose to integrate them into a unified framework for maximizing re-ranking benefits. Our re-ranking algorithms are memory and computation efficient, and experimental results on benchmark datasets demonstrate that they compare favorably with the state-of-the-art. Our code is available at https://github.com/MaJinWakeUp/rerank.
... The discriminating power of the CBMR system may increase as more information describing the media content is included in the metadata. Moreover, the multimodal CBMR system can capture users' subjective query concepts thoroughly, and hence achieve high search accuracy [23,29]. In this paper, we assume that complex similarity queries are composed of Boolean combinations of feature descriptions. ...
... The approach [29] handles complex and descriptive queries of multiple modalities and supports query specification using semantic constructs such as objects, attributes, and relationships. To handle multi-modal queries it uses a hashing algorithm to map queries of different modalities to the same binary representation and uses the Hamming distance as the dissimilarity function in the binary space. ...
Article
Full-text available
This paper proposes a novel indexing method for complex similarity queries in high-dimensional image and video systems. In order to provide the indexing method with the flexibility in dealing with multiple features and multiple query objects, we treat every dimension independently. The efficiency of our method is realized by a specialized bitmap indexing that represents all objects in a database as a set of bitmaps. The percentage of data accessed in our indexing method is inversely proportional to the overall dimensionality, and thus the performance deterioration with the increasing dimensionality does not occur. To demonstrate the efficacy of our method we conducted extensive experiments and compared the performance with the VA-file-based index and the linear scan by using real image and video datasets, and obtained a remarkable speed-up over them.
... Existing works [2][3][4][5][6] on complex queries mainly find the relevance of the individual concepts of the query with the corpus images in a similar way as when processing a simple query. Nie et al. [2] proposed an image re-ranking framework that exploits web sources such as Google search results in addition to visual and textual information for query-image relevance estimation. ...
Article
With the increase in popularity of image-based applications, users are retrieving images using more sophisticated and complex queries. We present three types of complex queries, namely, long, ambiguous, and abstract. Each type of query has its own characteristics/complexities and thus leads to imprecise and incomplete image retrieval. Existing methods for image retrieval are unable to deal with the high complexity of such queries. Search engines need to integrate their image retrieval process with knowledge to obtain rich semantics for effective retrieval. We propose a framework, Image Retrieval using Knowledge Embedding (ImReKE), for embedding knowledge with images and queries, allowing retrieval approaches to understand the context of queries and images in a better way. ImReKE (IR_Approach, Knowledge_Base) takes two inputs, namely, an image retrieval approach and a knowledge base. It selects quality concepts (concepts that possess properties such as rarity, newness , etc.) from the knowledge base to provide rich semantic representations for queries and images to be leveraged by the image retrieval approach. For the first time, an effective knowledge base that exploits both the visual and textual information of concepts has been developed. Our extensive experiments demonstrate that the proposed framework improves image retrieval significantly for all types of complex queries. The improvement is remarkable in the case of abstract queries, which have not yet been dealt with explicitly in the existing literature. We also compare the quality of our knowledge base with the existing text-based knowledge bases, such as ConceptNet, ImageNet, and the like.
... Cartoon object models are pre-defined and 2D clipart images are composed according to the text. Siddiquie et al. [18] devise a multi-modal framework for retrieving images from information including images, sketches and text by jointly considering objects, attributes and spatial relationships. Their goal is achieved by converting all information into the same representation, i.e., 2D sketches. ...
Article
Full-text available
Spatial relationships between objects provide important information for text-based image retrieval. As users are more likely to describe a scene from a real world perspective, using 3D spatial relationships rather than 2D relationships that assume a particular viewing direction, one of the main challenges is to infer the 3D structure that bridges images with users' text descriptions. However, direct inference of 3D structure from images requires learning from large scale annotated data. Since interactions between objects can be reduced to a limited set of atomic spatial relations in 3D, we study the possibility of inferring 3D structure from a text description rather than an image, applying physical relation models to synthesize holistic 3D abstract object layouts satisfying the spatial constraints present in a textual description. We present a generic framework for retrieving images from a textual description of a scene by matching images with these generated abstract object layouts. Images are ranked by matching object detection outputs (bounding boxes) to 2D layout candidates (also represented by bounding boxes) which are obtained by projecting the 3D scenes with sampled camera directions. We validate our approach using public indoor scene datasets and show that our method outperforms an object occurrence based and a learned 2D pairwise relation based baselines.
... Convolutional Neural Networks (CNN) have also been used to learn high-level features and combine two modalities (text-image pairs) for cross-modal retrieval [12]. A Partial Least Squares (PLS) based approach [10] that maps different modalities into a common latent space has also been used in an image retrieval task. Contrary to these approaches that require training, our focus is on unsupervised multimodal fusion. ...
Conference Paper
Full-text available
Effective multimedia retrieval requires the combination of the heterogeneous media contained within multimedia objects and the features that can be extracted from them. To this end, we extend a unifying framework that integrates all well-known weighted, graph-based, and diffusion-based fusion techniques that combine two modalities (textual and visual similarities) to model the fusion of multiple modalities. We also provide a theoretical formula for the optimal number of documents that need to be initially selected, so that the memory cost in the case of multiple modalities remains the same as in the case of two modalities. Experiments using two test collections and three modalities (similarities based on visual descriptors, visual concepts, and textual concepts) indicate improvements in the effectiveness over bimodal fusion under the same memory complexity.
... A unifying model for unsupervised fusion of all similarities per modality has been presented in [10] and has been generalized to a non-linear fusion approach [2] that combines cross-media similarities with diffusion-based scores on the graph of items, in a non-linear but scalable way, for several modalities. Other methodologies for combining heterogeneous modalities involve Partial Least Squares [11], [12] and correlation matching, mapping multiple modalities to points in a common linear subspace. In [13] a video retrieval framework is proposed, which fuses textual and visual information in a non-linear way. ...
... Currently, the investigation of multiple views and modalities in computer vision tasks, such as visual tracking [12][13][14], object recognition [15,16,18,19], and visual retrieval [21,32], have attracted many research interests, especially the cross-modal retrieval problem [10,36,42]. The various modal data existing on the Web, such as image, text and video, provide complementary information to describe the semantics of objects. ...
Article
Full-text available
Recent years have witnessed a surge of interests in cross-modal ranking. To bridge the gap between heterogeneous modalities, many projection based methods have been studied to learn common subspace where the correlation across different modalities can be directly measured. However, these methods generally consider pair-wise relationship merely, while ignoring the high-order relationship. In this paper, a combinative hypergraph learning in subspace for cross-modal ranking (CHLS) is proposed to enhance the performance of cross-modal ranking by capturing high-order relationship. We formulate the cross-modal ranking as a hypergraph learning problem in latent subspace where the high-order relationship among ranking instances can be captured. Furthermore, we propose a combinative hypergraph based on fused similarity information to encode both the intra-similarity in each modality and the inter-similarity across different modalities into the compact subspace representation, which can further enhance the performance of cross-modal ranking. Experiments on three representative cross-modal datasets show the effectiveness of the proposed method for cross-modal ranking. Furthermore, the ranking results achieved by the proposed CHLS can recall 80% of the relevant cross-modal instances at a much earlier stage compared against state-of-the-art methods for both cross-modal ranking tasks, i.e. image query text and text query image.
... Related Work Hashing techniques [16,33,58,38,34,17,70,35,14,66,36,44,37,51,25,32] have recently been successfully applied to encode high-dimensional features into compact similarity-preserving binary codes, which enables extremely fast similarity search by the use of Hamming distances. Inspired by this, some recent SBIR works [1,15,40,52,54,56] have incorporated existing hashing methods for efficient retrieval. For instance, LSH [16] and ITQ [17] are adopted to sketch-based image [1] and 3D model [15] retrieval tasks, respectively. ...
Article
Free-hand sketch-based image retrieval (SBIR) is a specific cross-view retrieval task, in which queries are abstract and ambiguous sketches while the retrieval database is formed with natural images. Work in this area mainly focuses on extracting representative and shared features for sketches and natural images. However, these can neither cope well with the geometric distortion between sketches and images nor be feasible for large-scale SBIR due to the heavy continuous-valued distance computation. In this paper, we speed up SBIR by introducing a novel binary coding method, named \textbf{Deep Sketch Hashing} (DSH), where a semi-heterogeneous deep architecture is proposed and incorporated into an end-to-end binary coding framework. Specifically, three convolutional neural networks are utilized to encode free-hand sketches, natural images and, especially, the auxiliary sketch-tokens which are adopted as bridges to mitigate the sketch-image geometric distortion. The learned DSH codes can effectively capture the cross-view similarities as well as the intrinsic semantic correlations between different categories. To the best of our knowledge, DSH is the first hashing work specifically designed for category-level SBIR with an end-to-end deep architecture. The proposed DSH is comprehensively evaluated on two large-scale datasets of TU-Berlin Extension and Sketchy, and the experiments consistently show DSH's superior SBIR accuracies over several state-of-the-art methods, while achieving significantly reduced retrieval time and memory footprint.
Multi-modal similarity search has attracted considerable attention to meet the need of information retrieval across different types of media. To enable efficient multi-modal similarity search in large-scale databases recently, researchers start to study multi-modal hashing. Most of the existing methods are applied to search across multi-views among which explicit correspondence is provided. Given a multi-modal similarity search task, we observe that abundant multi-view data can be found on the Web which can serve as an auxiliary bridge. In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence. HTH provides more flexible and discriminative ability by embedding heterogeneous media into different Hamming spaces, compared to almost all existing methods thatmap heterogeneous data in a commonHamming space.We formulate a joint optimization model to learn hash functions embedding heterogeneous media into different Hamming spaces, and a translator aligning different Hamming spaces. The extensive experiments on two real-world datasets, one publicly available dataset of Flickr, and the other MIRFLICKR-Yahoo Answers dataset, highlight the effectiveness and efficiency of our algorithm.
Article
For a number of important problems, isolated semantic representations of individual syntactic words or visual objects do not suffice, but instead a compositional semantic representation is required, for examples, a literal phrase or a set of spatially-concurrent objects. In this paper, we aim to harness the existing image-sentence databases to exploit the compositional nature of image-sentence data for multi-modal deep embedding. Specifically, we propose an approach called hierarchical-alike (bottom-up two layers) multi-modal grounded compositional semantics learning (hiMoCS). The proposed hiMoCS systemically captures the compositional semantic connotation of multimodal data in the setting of hierarchical-alike deep learning by modelling the inherent correlations between two modalities of collaboratively grounded semantics, such as the textual entity (with its describing attribute) and visual object, the phrase (e.g., subject-verb-object triplet) and spatially-concurrent objects. We argue that hiMoCS is more appropriate to reflect the multi-modal compositional semantics of the image and its narrative textual sentence which are strongly-coupled. We evaluate hiMoCS on the several benchmark datasets and show that the utilization of the hierarchical-alike multi-modal grounded compositional semantics (textual entities and visual objects, textual phrase and spatiallyconcurrent objects) achieves a much better performance than only using the flat grounded compositional semantics.
Article
We propose an approach for retrieving a sequence of natural sentences for an image stream. Since general users often take a series of pictures on their experiences, much online visual information exists in the form of image streams, for which it would better take into consideration of the whole image stream to produce natural language descriptions. While almost all previous studies have dealt with the relation between a single image and a single natural sentence, our work extends both input and output dimension to a sequence of images and a sequence of sentences. For retrieving a coherent flow of multiple sentences for a photo stream, we propose a multimodal neural architecture called coherence recurrent convolutional network (CRCN), which consists of convolutional neural networks, bidirectional long short-term memory (LSTM) networks, and an entity-based local coherence model. Our approach directly learns from vast user-generated resource of blog posts as text-image parallel training data. We collect more than 22K unique blog posts with 170K associated images for the travel topics of NYC, Disneyland, Australia, and Hawaii. We demonstrate that our approach outperforms other state-of-the-art image captioning methods for text sequence generation, using both quantitative measures and user studies via Amazon Mechanical Turk.
Conference Paper
Full-text available
Image retrieval approaches dealing with the complex problem of image search and retrieval in very large image datasets proposed so far can be roughly divided into those that use text descriptions of images (text-based image retrieval) and those that compare visual image content (content-based image retrieval). Both approaches have their strengths and drawbacks especially in the case of searching for images in general unconstrained domain. To take advantage of both approaches, we propose a multimodal framework that uses both keywords and visual properties of images. Keywords are used to determine the semantics of the query while the example image presents the visual impression (perceptual and structural information) that retrieved images should suit. In the paper, the overview of the proposed multimodal image retrieval framework is presented. For computing the content-based similarity between images different feature sets and metrics were tested. The procedure is described with Corel and Flickr images from the domain of outdoor scenes.
Chapter
Large-scale data management and retrieval in complex domains such as images, videos, or biometrical data remains one of the most important and challenging information processing tasks. Even after two decades of intensive research, many questions still remain to be answered before working tools become available for everyday use. In this work, we focus on the practical applicability of different multi-modal retrieval techniques. Multi-modal searching, which combines several complementary views on complex data objects, follows the human thinking process and represents a very promising retrieval paradigm. However, a rapid development of modality fusion techniques in several diverse directions and a lack of comparisons between individual approaches have resulted in a confusing situation when the applicability of individual solutions is unclear. Aiming at improving the research community’s comprehension of this topic, we analyze and systematically categorize existing multi-modal search techniques, identify their strengths, and describe selected representatives. In the second part of the paper, we focus on the specific problem of large-scale multi-modal image retrieval on the web. We analyze the requirements of such task, implement several applicable fusion methods, and experimentally evaluate their performance in terms of both efficiency and effectiveness. The extensive experiments provide a unique comparison of diverse approaches to modality fusion in equal settings on two large real-world datasets.
Chapter
Towards meeting both challenges of big multimedia data such as scalability and diversity, this chapter presents the state‐of‐the‐art techniques in multimodal fusion of heterogeneous sources of data. It explores both weakly supervised and semi‐supervised approaches that minimize the complexity of the designed systems as well as maximizing their scalability potential to numerous concepts. The chapter demonstrates the benefits of the presented methods in two wide domains: multimodal fusion in multimedia retrieval and in multimedia classification. The chapter presents a unifying graph‐based model for fusing two modalities. It shows the benefits of the proposed probabilistic fusion compared with baseline configurations incorporating either only visual or textual information, known late fusion techniques, as well as state‐of‐the‐art methods in image classification with noisy labeled training data. The chapter provides a parametric analysis of the proposed approach, demonstrating how each parameter can change social active learning for image classification's (SALIC) performance.
Conference Paper
Full-text available
The present work addresses the challenge of integrating low-level information with high-level knowledge (known as semantic gap) that exists in content-based image retrieval by introducing an approach to describe images by means of spatial relations. The proposed approach is called Image Retrieval using Region Analysis (IRRA) and relies on decomposing images into pairs of objects. This method generates a representation composed of n triples, each one containing: a noun, a preposition and, another noun. This representation paves the way to enable image retrieval based on spatial relations. Results for an indoor/outdoor classifier shows that neural networks alone are capable of achieving 88% in precision and recall, but when combined with ontology this result increases in 10 percentage points, reaching 98% of precision and recall.
Article
An approach to image retrieval using spatial configurations is presented. The goal is to search the database for images that contain similar objects (image-patches) with a given configuration, size and position. The proposed approach consists of creating localized representations robust to segmentation variations, and a sub-graph matching method to compare the query with the database items. Localized object representations are created using a community detection method that groups visually similar segments. Extensive experimental results on three challenging datasets are provided to demonstrate the feasibility of the approach.
Conference Paper
Full-text available
Retrieving images to match with a hand-drawn sketch query is a highly desired feature, especially with the popularity of devices with touch screens. Although query-by-sketch has been extensively studied since 1990s, it is still very challenging to build a real-time sketch-based image search engine on a large-scale database due to the lack of effective and efficient matching/indexing solutions. The explosive growth of web images and the phenomenal success of search techniques have encouraged us to revisit this problem and target at solving the problem of web-scale sketch-based image retrieval. In this work, a novel index structure and the corresponding raw contour-based matching algorithm are proposed to calculate the similarity between a sketch query and natural images, and make sketch-based image retrieval scalable to millions of images. The proposed solution simultaneously considers storage cost, retrieval accuracy, and efficiency, based on which we have developed a real-time sketch-based image search engine by indexing more than 2 million images. Extensive experiments on various retrieval tasks (basic shape search, specific image search, and similar image search) show better accuracy and efficiency than state-of-the-art methods.
Conference Paper
Full-text available
We address the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory usage of the representation. We first propose a simple yet efficient way of aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation. We then show how to jointly optimize the dimension reduction and the indexing algorithm, so that it best preserves the quality of vector comparison. The evaluation shows that our approach significantly outperforms the state of the art: the search accuracy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms.
Conference Paper
Full-text available
Image auto-annotation is an important open problem in computer vision. For this task we propose TagProp, a discriminatively trained nearest neighbor model. Tags of test images are predicted using a weighted nearest-neighbor model to exploit labeled training images. Neighbor weights are based on neighbor rank or distance. TagProp allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set. In this manner, we can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms. We also introduce a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words. We investigate the performance of different variants of our model and compare to existing work. We present experimental results for three challenging data sets. On all three, TagProp makes a marked improvement as compared to the current state-of-the-art.
Conference Paper
Full-text available
We have created the first image search engine based entirely on faces. Using simple text queries such as "smiling men with blond hair and mustaches," users can search through over 3.1 million faces which have been automatically labeled on the basis of several facial attributes. Faces in our database have been extracted and aligned from images downloaded from the internet using a commercial face detector, and the number of images and attributes continues to grow daily. Our classification approach uses a novel combination of Support Vector Machines and Adaboost which exploits the strong structure of faces to select and train on the optimal set of features for each attribute. We show state-of-the-art classification results compared to previous works, and demonstrate the power of our architecture through a functional, large-scale face search engine. Our framework is fully automatic, easy to scale, and computes all labels off-line, leading to fast on-line search performance. In addition, we describe how our system can be used for a number of applications, including law enforcement, social networks, and personal photo management. Our search engine will soon be made publicly available.
Conference Paper
Full-text available
Learning visual classifiers for object recognition from weakly labeled data requires determining correspondence between image regions and semantic object classes. Most approaches use co-occurrence of “nouns” and image features over large datasets to determine the correspondence, but many correspondence ambiguities remain. We further constrain the correspondence problem by exploiting additional language constructs to improve the learning process from weakly labeled data. We consider both “prepositions” and “comparative adjectives” which are used to express relationships between objects. If the models of such relationships can be determined, they help resolve correspondence ambiguities. However, learning models of these relationships requires solving the correspondence problem. We simultaneously learn the visual features defining “nouns” and the differential visual features defining such “binary-relationships” using an EM-based approach.
Conference Paper
Full-text available
Humans can prepare concise descriptions of pictures, focusing on what they find important. We demonstrate that automatic methods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned using data. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.
Article
Full-text available
We present a system that composes a realistic picture from a simple freehand sketch annotated with text labels. The composed picture is generated by seamlessly stitching several photographs in agreement with the sketch and text labels; these are found by searching the Internet. Although online image search generates many inappropriate results, our system is able to automatically select suitable photographs to generate a high quality composition, using a filtering scheme to exclude undesirable images. We also provide a novel image blending algorithm to allow seamless image composition. Each blending result is given a numeric score, allowing us to find an optimal combination of discovered images. Experimental results show the method is very successful; we also evaluate our system using the results from two user studies.
Article
Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high level visual tasks, such low-level image representations are potentially not enough. In this paper, we propose a high-level image representation, called the Object Bank, where an image is represented as a scale-invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on the Object Bank representation, superior performances on high level visual recognition tasks can be achieved with simple off-the-shelf classifiers such as logistic regression and linear SVM. Sparsity algorithms make our representation more efficient and scalable for large scene datasets, and reveal semantically meaningful feature patterns.
Conference Paper
We consider image retrieval with structured object queries --- queries that specify the objects that should be present in the scene, and their spatial relations. An example of such queries is "car on the road". Existing image retrieval systems typically consider queries consisting of object classes (i.e. keywords). They train a separate classifier for each object class and combine the output heuristically. In contrast, we develop a learning framework to jointly consider object classes and their relations. Our method considers not only the objects in the query ("car" and "road" in the above example), but also related object categories can be useful for retrieval. Since we do not have ground-truth labeling of object bounding boxes on the test image, we represent them as latent variables in our model. Our learning method is an extension of the ranking SVM with latent variables, which we call latent ranking SVM. We demonstrate image retrieval and ranking results on a dataset with more than a hundred of object classes.
Article
Current approaches to object category recognition require datasets of training images to be manually prepared, with varying degrees of supervision. We present an approach that can learn an object category from just its name, by utilizing the raw output of image search engines available on the Internet. We develop a new model, TSI-pLSA, which extends pLSA (as applied to visual words) to include spatial information in a translation and scale invariant manner. Our approach can handle the high intra-class variability and large proportion of unrelated images returned by search engines. We evaluate the models on standard test sets, showing performance competitive with existing methods trained on hand prepared datasets.
State of the art methods for image and object retrieval exploit both appearance (via visual words) and local geometry (spatial extent, relative pose). In large scale problems, memory becomes a limiting factor - local geometry is stored for each feature detected in each image and requires storage larger than the inverted file and term frequency and inverted document frequency weights together. We propose a novel method for learning discretized local geometry representation based on minimization of average reprojection error in the space of ellipses. The representation requires only 24 bits per feature without drop in performance. Additionally, we show that if the gravity vector assumption is used consistently from the feature description to spatial verification, it improves retrieval performance and decreases the memory footprint. The proposed method outperforms state of the art retrieval algorithms in a standard image retrieval benchmark.
The most popular approach to large scale image retrieval is based on the bag-of-visual-word (BoV) representation of images. The spatial information is usually re-introduced as a post-processing step to re-rank the retrieved images, through a spatial verification like RANSAC. Since the spatial verification techniques are computationally expensive, they can be applied only to the top images in the initial ranking. In this paper, we propose an approach that can encode more spatial information into BoV representation and that is efficient enough to be applied to large-scale databases. Other works pursuing the same purpose have proposed exploring the word co-occurrences in the neighborhood areas. Our approach encodes more spatial information through the geometry-preserving visual phrases (GVP). In addition to co-occurrences, the GVP method also captures the local and long-range spatial layouts of the words. Our GVP based searching algorithm increases little memory usage or computational time compared to the BoV method. Moreover, we show that our approach can also be integrated to the min-hash method to improve its retrieval accuracy. The experiment results on Oxford 5K and Flicker 1M dataset show that our approach outperforms the BoV method even following a RANSAC verification.
We posit that visually descriptive language offers computer vision researchers both information about the world, and information about how people describe the world. The potential benefit from this source is made more significant due to the enormous amount of language data easily available today. We present a system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision. The system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.
This paper presents a novel way to perform multi-modal face recognition. We use Partial Least Squares (PLS) to linearly map images in different modalities to a common linear subspace in which they are highly correlated. PLS has been previously used effectively for feature selection in face recognition. We show both theoretically and experimentally that PLS can be used effectively across modalities. We also formulate a generic intermediate subspace comparison framework for multi-modal recognition. Surprisingly, we achieve high performance using only pixel intensities as features. We experimentally demonstrate the highest published recognition rates on the pose variations in the PIE data set, and also show that PLS can be used to compare sketches to photos, and to compare images taken at different resolutions.
We propose a novel approach for ranking and retrieval of images based on multi-attribute queries. Existing image retrieval methods train separate classifiers for each word and heuristically combine their outputs for retrieving multiword queries. Moreover, these approaches also ignore the interdependencies among the query terms. In contrast, we propose a principled approach for multi-attribute retrieval which explicitly models the correlations that are present between the attributes. Given a multi-attribute query, we also utilize other attributes in the vocabulary which are not present in the query, for ranking/retrieval. Furthermore, we integrate ranking and retrieval within the same formulation, by posing them as structured prediction problems. Extensive experimental evaluation on the Labeled Faces in the Wild(LFW), FaceTracer and PASCAL VOC datasets show that our approach significantly outperforms several state-of-the-art ranking and retrieval methods.
Conference Paper
Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high-dimensional features or complex distance functions into a low-dimensional Hamming space where items can be efficiently searched. However, existing methods do not apply for high-dimensional kernelized data when the underlying feature embedding for the kernel is unknown. We show how to generalize locality-sensitive hashing to accommodate arbitrary kernel functions, making it possible to preserve the algorithm's sub-linear time similarity search guarantees for a wide class of useful similarity functions. Since a number of successful image-based kernels have unknown or incomputable embeddings, this is especially valuable for image retrieval tasks. We validate our technique on several large-scale datasets, and show that it enables accurate and fast performance for example-based object classification, feature matching, and content-based retrieval.
The Internet contains billions of images, freely available online. Methods for efficiently searching this incredibly rich resource are vital for a large number of applications. These include object recognition (2), computer graphics (11, 27), personal photo collections, online image search tools. In this paper, our goal is to develop efficient image search and scene matching techniques that are not only fast, but also require very little memory, enabling their use on standard hardware or even on handheld devices. Our approach uses recently developed machine learning tech- niques to convert the Gist descriptor (a real valued vector that describes orientation energies at different scales and orientations within an image) to a compact binary code, with a few hundred bits per image. Using our scheme, it is possible to perform real-time searches with millions from the Internet using a single large PC and obtain recognition results comparable to the full descriptor. Using our codes on high quality labeled images from the LabelMe database gives surprisingly powerful recognition results using simple nearest neighbor techniques.
We propose semantic texton forests, efficient and pow- erful new low-level features. These are ensembles of deci- sion trees that act directly on image pixels, and therefore do not need the expensive computation of filter-bank responses or local descriptors. They are extremely fast to both train and test, especially compared with k-means clustering and nearest-neighbor assignment of feature descriptors. The nodes in the trees provide (i) an implicit hierarchical clus- tering into semantic textons, and (ii) an explicit local clas- sification estimate. Our second contribution, the bag of se- mantic textons, combines a histogram of semantic textons over an image region with a region prior category distri- bution. The bag of semantic textons is computed over the whole image for categorization, and over local rectangu- lar regions for segmentation. Including both histogram and region prior allows our segmentation algorithm to exploit both textural and semantic context. Our third contribution is an image-level prior for segmentation that emphasizes those categories that the automatic categorization believes to be present. We evaluate on two datasets including the very challenging VOC 2007 segmentation dataset. Our re- sults significantly advance the state-of-the-art in segmenta- tion accuracy, and furthermore, our use of efficient decision forests gives at least a five-fold increase in execution speed.
This paper addresses the problem of learning similaritypreserving binary codes for efficient retrieval in large-scale image collections. We propose a simple and efficient alternating minimization scheme for finding a rotation of zerocentered data so as to minimize the quantization error of mapping this data to the vertices of a zero-centered binary hypercube. This method, dubbed iterative quantization (ITQ), has connections to multi-class spectral clustering and to the orthogonal Procrustes problem, and it can be used both with unsupervised data embeddings such as PCA and supervised embeddings such as canonical correlation analysis (CCA). Our experiments show that the resulting binary coding schemes decisively outperform several other state-of-the-art methods.
There has been a growing interest in exploiting contextual information in addition to local features to detect and localize multiple object categories in an image. Context models can efficiently rule out some unlikely combinations or locations of objects and guide detectors to produce a semantically coherent interpretation of a scene. However, the performance benefit from using context models has been limited because most of these methods were tested on datasets with only a few object categories, in which most images contain only one or two object categories. In this paper, we introduce a new dataset with images that contain many instances of different object categories and propose an efficient model that captures the contextual information among more than a hundred of object categories. We show that our context model can be applied to scene understanding tasks that local detectors alone cannot solve.
Conference Paper
In this paper, we present a novel image search system, image search by concept map. This system enables users to indicate not only what semantic concepts are expected to appear but also how these concepts are spatially distributed in the desired images. To this end, we propose a new image search interface to enable users to formulate a query, called concept map, by intuitively typing textual queries in a blank canvas to indicate the desired spatial positions of the concepts. In the ranking process, by interpreting each textual concept as a set of representative visual instances, the concept map query is translated into a visual instance map, which is then used to evaluate the relevance of the image in the database. Experimental results demonstrate the effectiveness of the proposed system.
Conference Paper
We describe a method for generating N-best configurations from part-based models, ensuring that they do not overlap according to some user-provided definition of overlap. We extend previous N-best algorithms from the speech community to incorporate non-maximal suppression cues, such that pixel-shifted copies of a single configuration are not returned. We use approximate algorithms that perform nearly identical to their exact counterparts, but are orders of magnitude faster. Our approach outperforms standard methods for generating multiple object configurations in an image. We use our method to generate multiple pose hypotheses for the problem of human pose estimation from video sequences. We present quantitative results that demonstrate that our framework significantly improves the accuracy of a state-of-the-art pose estimation algorithm.
Conference Paper
Many applications in Multilingual and Multimodal Information Access involve searching large databases of high dimensional data objects with multiple (conditionally independent) views. In this work we consider the problem of learning hash functions for similarity search across the views for such applications. We propose a principled method for learning a hash function for each view given a set of multiview training data objects. The hash functions map similar objects to similar codes across the views thus enabling cross-view similarity search. We present results from an extensive empirical study of the proposed approach which demonstrate its effectiveness on Japanese language People Search and Multilingual People Search problems. [12/07/2011: Fixed the typographical errors in the IJCAI version of the paper that confused some readers. Thank you Abhishek Sharma and Bin Li for bringing the errors to my notice.]
The problem of large-scale image search has been traditionally addressed with the bag-of-visual-words (BOV). In this article, we propose to use as an alternative the Fisher kernel framework. We first show why the Fisher representation is well-suited to the retrieval problem: it describes an image by what makes it different from other images. One drawback of the Fisher vector is that it is high-dimensional and, as opposed to the BOV, it is dense. The resulting memory and computational costs do not make Fisher vectors directly amenable to large-scale retrieval. Therefore, we compress Fisher vectors to reduce their memory footprint and speed-up the retrieval. We compare three binarization approaches: a simple approach devised for this representation and two standard compression techniques. We show on two publicly available datasets that compressed Fisher vectors perform very well using as little as a few hundreds of bits per image, and significantly better than a very recent compressed BOV approach.
Conference Paper
We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under l0RW1S34RfeSDcfkexd09rT4p1RW1S34RfeSDcfkexd09rT4 norm, based on p-stable distributions. Our scheme improves the running time of the earlier algorithm for the case of the l0RW1S34RfeSDcfkexd09rT421RW1S34RfeSDcfkexd09rT4 norm. It also yields the first known provably efficient approximate NN algorithm for the case p less than or equal 1. We also show that the algorithm finds the exact near neigbhor in O(log n) time for data satisfying certain "bounded growth" condition. Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the resulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than kd-tree.
We demonstrate a method for identifying images containing categories of animals. The images we classify depict animals in a wide range of aspects, configurations and appearances. In addition, the images typically portray multiple species that differ in appearance (e.g. ukari’s, vervet monkeys, spider monkeys, rhesus monkeys, etc.). Our method is accurate despite this variation and relies on four simple cues: text, color, shape and texture. Visual cues are evaluated by a voting method that compares local image phenomena with a number of visual exemplars for the category. The visual exemplars are obtained using a clustering method applied to text on web pages. The only supervision required involves identifying which clusters of exemplars refer to which sense of a term (for example, "monkey" can refer to an animal or a bandmember). Because our method is applied to web pages with free text, the word cue is extremely noisy. We show unequivocal evidence that visual information improves performance for our task. Our method allows us to produce large, accurate and challenging visual datasets mostly automatically.
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.
Article
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.
Article
We present a method for searching in an image database using a query image that is similar to the intended target. The query image may be a hand-drawn sketch or a (potentially low-quality) scan of the image to be retrieved. Our searching algorithm makes use of multiresolution wavelet decompositions of the query and database images. The coefficients of these decompositions are distilled into small "signatures" for each image. We introduce an "image querying metric" that operates on these signatures. This metric essentially compares how many significant wavelet coefficients the query has in common with potential targets. The metric includes parameters that can be tuned, using a statistical analysis, to accommodate the kinds of image distortions found in different types of image queries. The resulting algorithm is simple, requires very little storage overhead for the database of signatures, and is fast enough to be performed on a database of 20,000 images at interactive rates (on standard...
Article
1 We describe a highly functional prototype system for searching by visual features in an image database. The VisualSEEk system is novel in that the user forms the queries by diagramming spatial arrangements of color regions. The system finds the images that contain the most similar arrangements of similar regions. Prior to the queries, the system automatically extracts and indexes salient color regions from the images. By utilizing efficient indexing techniques for color information, region sizes and absolute and relative spatial locations, a wide variety of complex joint color/spatial queries may be computed. KEYWORDS: image databases, content-based retrieval, image indexing, similarity retrieval, spatial query. 1 INTRODUCTION In this paper we investigate a new content-based image query system that enables querying by image regions and spatial layout. VisualSEEk is a hybrid system in that it integrates feature-based image indexing with spatial query methods. The integration reli...
Locality sensitive binary codes from shift-invariant kernels
  • M Raginsky
  • S Lazebnik
Data-driven visual similarity for cross-domain image matching
  • Abhinav Shrivastava
  • Tomasz Malisiewicz
  • Abhinav Gupta
  • Alexei A Efros