Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this article, we present a novel framework that can produce a visual description of a tourist attraction by choosing the most diverse pictures from community-contributed datasets, which describe different details of the queried location. The main strength of the proposed approach is its flexibility that permits us to filter out non-relevant images and to obtain a reliable set of diverse and relevant images by first clustering similar images according to their textual descriptions and their visual content and then extracting images from different clusters according to a measure of the user’s credibility. Clustering is based on a two-step process, where textual descriptions are used first and the clusters are then refined according to the visual features. The degree of diversification can be further increased by exploiting users’ judgments on the results produced by the proposed algorithm through a novel approach, where users not only provide a relevance feedback but also a diversity feedback. Experimental results performed on the MediaEval 2015 “Retrieving Diverse Social Images” dataset show that the proposed framework can achieve very good performance both in the case of automatic retrieval of diverse images and in the case of the exploitation of the users’ feedback. The effectiveness of the proposed approach has been also confirmed by a small case study involving a number of real users.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In particular, they avoid either relying on costly professional textual annotations, that are needed to deem the labels and tags as being reliable, or relying on social labels and tags, for which an estimation of their relevance needs to be computed. Moreover, CBIR systems present advantages even in cases where textual annotations are already present, since they could be focused on just some aspects of the image and neglect other important contents [11]. Most of the existing successful CBIR systems are tailored to specific applications, usually referred to as narrow domain systems, and consequently to specific retrieval problems, such as the Computer Aided Diagnosis systems [27], sport events [12], and cultural heritage preservation [7]. ...
... While most of the current CBIR systems employ low-level and mid-level features [12,30], aimed to take into account information like colour, edge and texture [6], CBIR systems that address specific retrieval problems leverage on different kind of features that are specifically designed [40]. Some works exploited low-level image descriptors, such as SIFT [44] originally proposed for object recognition [25], different colour histograms [23] or a fusion of textual and visual information [11] for scene categorization. Also more specific low-level features designed for other applications have been used in CBIR systems, such as the HOG [10] and the LBP [28] descriptors, originally proposed for pedestrian detection and texture analysis respectively. ...
Chapter
Full-text available
After more than 20 years of research on Content-Based Image Retrieval (CBIR), the community is still facing many challenges to improve the retrieval results by filling the semantic gap between the user needs and the automatic image description provided by different image representations. Including the human in the loop through Relevance Feedback (RF) mechanisms turned out to help improving the retrieval results in CBIR. In this paper, we claim that Nearest Neighbour approaches still provide an effective method to assign a Relevance Score to images, after the user labels a small set of images as being relevant or not to a given query. Although many other approaches to relevance feedback have been proposed in the past ten years, we show that the Relevance Score, while simple in its implementation, allows attaining superior results with respect to more complex approaches, can be easily adopted with any feature representations. Reported results on different real-world datasets with a large number of classes, characterised by different degrees of semantic and visual intra- e inter-class variability, clearly show the current challenges faced by CBIR system in reaching acceptable retrieval performances, and the effectiveness of Nearest neighbour approaches to exploit Relevance Feedback.
... User 2 and user 3 classify images with different category granularities considered besides the dynamic classes when classifying images by user himself. Furthermore, user feedback must be taken into consideration in user-dependent applications, such as relevance feedback in image retrieval [10], interactive image collection navigation [4], and image annotation [19]. In these systems, the users can give feedback and get refined results immediately. ...
... The idea of "including the user feedback in the loop" is widely used to bridge the semantic gap in image retrieval [36]. Relevance feedback is proposed in this task to learn from a specific user by asking the user to provide feedback regarding the (ir)relevance of the current retrieval results [10]. The system then learns from these (ir)relevance feedbacks to achieve an improved performance in the next round, iteratively if necessary. ...
Article
Full-text available
With the explosive growth of personal photos, an effective classification tool is becoming an urgent need to organize our progressive image collections. Facing the dynamically growing collections, we present a new method to categorize images effectively by integrating image clustering, incremental updating and user feedback together in an online framework. Considering the user burden and the user-specific preference during image classification, we propose several strategies to learn a customized classification model progressively for each user. Firstly, we use a multi-view learning method to learn the preferred classification perspective of the user. Secondly, we cluster similar images into groups according to user’s preference, so that images in a group can be categorized simultaneously with high efficiency. Thirdly, we propose a multi-centroid nearest class mean classifier to online learn the user’s preferred category granularity, and use it to classify the image groups. Unlike offline systems where pre-labeling and batch training often take hours or even days to perform, our approach is fully online. It can learn the classification model and classify newly acquired images alternately in no time. The sufficient experimental results and a user study demonstrate the effectiveness of the proposed method.
... Multimodal learning involves integrating and processing data from multiple sensory channels or modalities, such as text, images, and audio, to build more robust and comprehensive learning models. It has widespread applications in multimedia (Naphade et al., 2006;Atrey et al., 2010;Dang-Nguyen et al., 2017), robotics (Kirchner et al., 2019;Lee et al., 2019), large language model (Huang et al., 2023;Gao et al., 2023;Driess et al., 2023) and healthcare (Muhammad et al., 2021;Vanguri et al., 2022;Lipkova et al., 2022). ...
Preprint
Multimodal learning has become a pivotal approach in developing robust learning models with applications spanning multimedia, robotics, large language models, and healthcare. The efficiency of multimodal systems is a critical concern, given the varying costs and resource demands of different modalities. This underscores the necessity for effective modality selection to balance performance gains against resource expenditures. In this study, we propose a novel framework for modality selection that independently learns the representation of each modality. This approach allows for the assessment of each modality's significance within its unique representation space, enabling the development of tailored encoders and facilitating the joint analysis of modalities with distinct characteristics. Our framework aims to enhance the efficiency and effectiveness of multimodal learning by optimizing modality integration and selection.
... A good case in example shall be attributed to the fact that community question answering websites have turned into valuable knowledge repositories of specific domains [28]. However, the massive growth of multimedia data in social media makes it increasingly complex for users to utilize the data [5,19]. Fortunately, human-computer conversation systems in social media enable humans to obtain the information they need conveniently, for which the conversation system has become one of the research hotspots in the field of the application of multimedia. ...
Article
Full-text available
The human–computer conversation system is a significant application in the field of multimedia. To select an appropriate response, retrieval-based systems model the matching between the dialogue history and response candidates. However, most of the existing methods cannot fully capture and utilize varied matching patterns, which may degrade the performance of the systems. To address the issue, a densely enhanced semantic network (DESN) is proposed in our work. Given a multi-turn dialogue history and a response candidate, DESN first constructs the semantic representations of sentences from the word perspective, the sentence perspective, and the dialogue perspective. In particular, the dialogue perspective is a novel one introduced in our work. The dependencies between a single sentence and the whole dialogue are modeled from the dialogue perspective. Then, the response candidate and each utterance in the dialogue history are made to interact with each other. The varied matching patterns are captured for each utterance–response pair by using a dense matching module. The matching patterns of all the utterance–response pairs are accumulated in chronological order to calculate the matching degree between the dialogue history and the response. The responses in the candidate pool are ranked with the matching degree, thereby returning the most appropriate candidate. Our model is evaluated on the benchmark datasets. The experimental results prove that our model achieves significant and consistent improvement when compared with other baselines.
... The recent advances and trends of MMDL are from Audio-visual speech recognition (AVSR) [124,161], multimedia content indexing and retrieval [8,13,34,127], understanding human multimodal behaviors during social interaction, multimodal emotion recognition [16,25,61,171], image and video captioning [36,78], Visual Question-Answering (VQA) [91], multimedia retrieval [128] to health analysis [155] and so on. In this article, we analyzed the latest MMDL models to propose Year Pub. ...
Article
Full-text available
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research directions.
... Different approaches have been proposed to reformulate the query according to the feedback received: finding an optimized query feature vector using Rocchio's algorithm [11,46,67,114], modifying the similarity measure so that relevant images have a high similarity value [1,124], exploiting images' geometrical and discriminant structures to learn a semantic subspace [39,120], or separating relevant and non-relevant images using Bayesian Networks [87], CNN [56,76,78,104], Clustering [27], Logistic Regression [30], Optimum Forest algorithm [54], and Support Vector Machine [82,97,112]. ...
Article
Full-text available
One of the main challenges in CBIR systems is to choose discriminative and compact features, among dozens, to represent the images under comparison. Over the years, a great effort has been made to combine multiple features, mainly using early, late, and hierarchical fusion techniques. Unveiling the perfect combination of features is highly domain-specific and dependent on the type of image. Thus, the process of designing a CBIR system for new datasets or domains involves a huge experimentation overhead, leading to multiple fine-tuned CBIR systems. It would be desirable to dynamically find the best combination of CBIR systems without needing to go through such extensive experimentation and without requiring previous domain knowledge. In this paper, we propose ExpertosLF, a model-agnostic interpretable late fusion technique based on online learning with expert advice, which dynamically combines CBIR systems without knowing a priori which ones are the best for a given domain. At each query, ExpertosLF takes advantage of user’s feedback to determine each CBIR contribution in the ensemble for the following queries. ExpertosLF produces an interpretable ensemble that is independent of the dataset and domain. Moreover, ExpertosLF is designed to be modular, and scalable. Experiments on 13 benchmark datasets from the Biomedical, Real, and Sketch domains revealed that: (i) ExpertosLF surpasses the performance of state of the art late-fusion techniques; (ii) it successfully and quickly converges to the performance of the best CBIR sets across domains without any previous domain knowledge (in most cases, fewer than 25 queries need to receive human feedback).
... Based on the similarity graph, a combination of three clustering algorithms ( namely Metis, spectral clustering and Hierarchical clustering) is applied for reranking. -Pra-mm [13,14]: The BIRCH clustering algorithm is applied on the provided visual features to group visually similar images into the same cluster. Clusters are then sorted based on their cardinalities. ...
Article
Full-text available
Image search reranking is emerging as an effective technique to refine the text-based image search results using visual information. In this paper, we introduce a novel hypergraph-based image search reranking method that accounts for both relevance and diversity of search results. Namely, the text-based image search results are taken as vertices in a probabilistic regression-based hypergraph model and reranking is formulated as a hypergraph ranking problem with absorbing nodes. More specifically, to discover related samples and characterize the relationships among them, we bring the Elastic Net regularized regression model into the hypergraph construction. Exceeding the conventional hypergraph construction schemes, our scheme is able to describe the high-order relationships and the local manifold structure among visual samples while ensuring the datum-adaptiveness. Afterward, we apply a hypergraph-based ranking with absorbing nodes to ensure a diversified reranking. That is, during the reranking process, previously-ranked samples are transformed into absorbing nodes at each iteration, thereby redundant ones are prevented from receiving high ranking scores. Extensive experiments on real-world data from Flickr suggest our proposed reranking method achieves promising results compared to existing reranking methods.
... Different approaches have been proposed to exploit the feedback for refining the parameters of the search. Such as by computing a new query vector [36], or by modifying the similarity measure in such a way that relevant images have a high similarity value [33], or trying to separate relevant and non-relevant images using pattern classification techniques such as Support Vector Machines [20], Decision Trees [24], Clustering [9], Random Forests [6] or Convolutional Neural Networks [48]. Our approach processes the user's feedback and separates relevant from non-relevant images by exploiting a CNN pre-trained on a large data set as in [31,49]. ...
Article
Full-text available
Given the great success of Convolutional Neural Network (CNN) for image representation and classification tasks, we argue that Content-Based Image Retrieval (CBIR) systems could also leverage on CNN capabilities, mainly when Relevance Feedback (RF) mechanisms are employed. On the one hand, to improve the performances of CBIRs, that are strictly related to the effectiveness of the descriptors used to represent an image, as they aim at providing the user with images similar to an initial query image. On the other hand, to reduce the semantic gap between the similarity perceived by the user and the similarity computed by the machine, by exploiting an RF mechanism where the user labels the returned images as being relevant or not concerning her interests. Consequently, in this work, we propose a CBIR system based on transfer learning from a CNN trained on a vast image database, thus exploiting the generic image representation that it has already learned. Then, the pre-trained CNN is also fine-tuned exploiting the RF supplied by the user to reduce the semantic gap. In particular, after the user’s feedback, we propose to tune and then re-train the CNN according to the labelled set of relevant and non-relevant images. Then, we suggest different strategies to exploit the updated CNN for returning a novel set of images that are expected to be relevant to the user’s needs. Experimental results on different data sets show the effectiveness of the proposed mechanisms in improving the representation power of the CNN with respect to the user concept of image similarity. Moreover, the pros and cons of the different approaches can be clearly pointed out, thus providing clear guidelines for the implementation in production environments.
... Several embedding approaches for transferring knowledge between the target and source modalities were proposed by them. Dang-Nguyen et al. [60] proposed a novel framework that can produce a visual description of a tourist attraction by choosing the most diverse pictures from community-contributed datasets to describe the queried location more comprehensively. Based on multi-graph enabled active learning, Wang et al. [61] presented a multi-modal web image retrieval technique to leverage the heterogeneous data on the web to improve retrieval precision. ...
Article
Full-text available
Due to the rapid development of mobile Internet techniques, such as online social networking and location-based services, massive amount of multimedia data with geographical information is generated and uploaded to the Internet. In this paper, we propose a novel type of cross-modal multimedia retrieval, called geo-multimedia cross-modal retrieval, which aims to find a set of geo-multimedia objects according to geographical distance proximity and semantic concept similarity. Previous studies for cross-modal retrieval and spatial keyword search cannot address this problem effectively because they do not consider multimedia data with geo-tags (geo-multimedia). Firstly, we present the definition of kNN geo-multimedia cross-modal query and introduce relevant concepts such as spatial distance and semantic similarity measurement. As the key notion of this work, cross-modal semantic representation space is formulated at the first time. A novel framework for geo-multimedia cross-modal retrieval is proposed, which includes multi-modal feature extraction, cross-modal semantic space mapping, geo-multimedia spatial index and cross-modal semantic similarity measurement. To bridge the semantic gap between different modalities, we also propose a method named cross-modal semantic matching (CoSMat for shot) which contains two important components, i.e., CorrProj and LogsTran, which aims to build a common semantic representation space for cross-modal semantic similarity measurement. In addition, to implement semantic similarity measurement, we employ deep learning based method to learn multi-modal features that contains more high level semantic information. Moreover, a novel hybrid index, GMR-Tree is carefully designed, which combines signatures of semantic representations and R-Tree. An efficient GMR-Tree based kNN search algorithm called kGMCMS is developed. Comprehensive experimental evaluations on real and synthetic datasets clearly demonstrate that our approach outperforms the-state-of-the-art methods.
... Considering the increasing interest of participants in ImageCLEFPhoto, the creation of the new collection was seen as a big achievement in that it provides a more realistic framework for the analysis of diversity and evaluation of retrieval systems aimed at promoting diverse results. The findings from this new collection were found to be promising, and we plan to make use of other diversity algorithms (Dang-Nguyen et al, 2017a) in the future to enable evaluation to be done more thoroughly. Finally, from the 2012 (Zellhöfer, 2012) and 2013 (Zellhöfer, 2013) tasks, the following insights were gained: ...
Chapter
Image retrieval was, and still is, a hot topic in research. It comes with many challenges that changed over the years with the emergence of more advanced methods for analysis and enormous growth of images created, shared and consumed. This chapter gives an overview of domain-specific image retrieval evaluation approaches, which were part of the ImageCLEF evaluation campaign . Specifically, the robot vision, photo retrieval, scalable image annotation and lifelogging tasks are presented. The ImageCLEF medical activity is described in a separate chapter in this volume. Some of the presented tasks have been available for several years, whereas others are quite new (like lifelogging). This mix of new and old topics has been chosen to give the reader an idea about the development and trends within image retrieval. For each of the tasks, the datasets, participants, techniques used and lessons learned are presented and discussed leading to a comprehensive summary.
... Most image diversification studies have focused on visual features. As notable examples, Dang-Nguyen et al. [13] performed clustering based on both textual descriptions and visual contents and Boato et al. [2] considered visual saliency as a diversification processing element. In addition, the graph clustering model was also applied to obtain diversified results by utilizing word-to-image correlation [47] . ...
Article
User requirements for result diversification in image retrieval have been increasing with the explosion of image resources. Result diversification requires that image retrieval systems are made capable of handling semantic gaps between image visual features and semantic concepts, and providing both relevant and diversified image results. Context information, such as captions, descriptions, and tags, provides opportunities for image retrieval systems to improve their result diversification. This study explores a mechanism for improving result diversification using the semantic distance of image social tags. We design and compare nine strategies that combine three different semantic distance algorithms (WordNet, Google Distance, and Explicit Semantic Analysis) with three re-ranking algorithms (MMR, xQuAD, and Score Difference) for result diversification. In order to better prove the effectiveness of our strategy of applying semantic information, we also make use of visual features of images for result diversification experiment and make comparison. Our data for experimentation were extracted from 269,648 images selected from the NUS-WIDE datasets with manually annotated subtopics. Experimental results affirm the effectiveness of applying semantic information for improving result diversification in image retrieval. In particular, WordNet-based semantic distance combined with the Score Difference (WordNet-DivScore) outperformed other strategies in diversifying image retrieval results.
... Several embedding approaches for transferring knowledge between the target and source modalities were proposed by them. Dang-Nguyen et al. [14] proposed a novel framework that can produce a visual description of a tourist attraction by choosing the most diverse pictures from community-contributed datasets to describe the queried location more comprehensively. This approach can filter out non-relevant images and to obtain a reliable set of diverse and relevant images by first clustering similar images according to their textual descriptions and their visual content. ...
Preprint
Due to the rapid development of mobile Internet techniques, cloud computation and popularity of online social networking and location-based services, massive amount of multimedia data with geographical information is generated and uploaded to the Internet. In this paper, we propose a novel type of cross-modal multimedia retrieval called geo-multimedia cross-modal retrieval which aims to search out a set of geo-multimedia objects based on geographical distance proximity and semantic similarity between different modalities. Previous studies for cross-modal retrieval and spatial keyword search cannot address this problem effectively because they do not consider multimedia data with geo-tags and do not focus on this type of query. In order to address this problem efficiently, we present the definition of kNN geo-multimedia cross-modal query at the first time and introduce relevant conceptions such as cross-modal semantic representation space. To bridge the semantic gap between different modalities, we propose a method named cross-modal semantic matching which contains two important component, i.e., CorrProj and LogsTran, which aims to construct a common semantic representation space for cross-modal semantic similarity measurement. Besides, we designed a framework based on deep learning techniques to implement common semantic representation space construction. In addition, a novel hybrid indexing structure named GMR-Tree combining geo-multimedia data and R-Tree is presented and a efficient kNN search algorithm called kGMCMS is designed. Comprehensive experimental evaluation on real and synthetic dataset clearly demonstrates that our solution outperforms the-state-of-the-art methods.
... In the task, the information about a place is very important so it is better to retrieve as much as possible diverse images. Duc-Tien et al. have introduced the method to deal with this problem using the dataset collected on Flickr [4,3]. The precision of the method has improved up to 75%. ...
Conference Paper
Lifelog data provide potential insight analysis and understanding about people in their daily activities. However, it is still a challenging problem to index lifelog data efficiently and to provide a user-friendly interface that supports users to retrieve moments of interest. This motivates our proposed system to retrieve lifelog moment based on visual concept fusion and text-based query expansion. We first extract visual concepts, including entities, actions, and places from images. Besides NeuralTalk, we also proposed a novel method using concept-encoded feature augmentation to generate text descriptions to exploit further semantics of images. Our proposed lifelog retrieval system allows a user to search for lifelog moment with four different types of queries on place, time, entity, and extra biometric data. Furthermore, the key feature of our proposed system is to automatically suggest concepts related to input query concepts to efficiently assist a user to expand a query. Experimental results on Lifelog moment retrieval dataset of ImageCLEF 2018 demonstrate the potential usage of our method and system to retrieve lifelog moments.
... Ranking. To refine the results, i.e., to increase the precision of the top retrieved images, we use a hierarchical agglomerative clustering algorithm (see [11]) to group similar images into the same cluster based on all of their features. The clusters are then sorted based on the number of images, in decreasing rank order. ...
Conference Paper
Full-text available
Nowadays, almost everyone holds some form or other of a personal life archive. Automatically maintaining such an archive is an activity that is becoming increasingly common, however without automatic support the users will quickly be overwhelmed by the volume of data and will miss out on the potential benefits that lifelogs provide. In this paper we give an overview of the current status of lifelog research and propose a concept for exploring these archives. We motivate the need for new methodologies for indexing data, organizing content and supporting information access. Finally we will describe challenges to be addressed and give an overview of initial steps that have to be taken, to address the challenges of organising and searching personal life archives.
Article
Before COVID-19, tourist destinations have experienced problems with congestion of both famous tourist attractions and public transportation. Over-tourism is not an issue at this time, but it is likely to rekindle after the COVID-19 pandemic ends. One method of mitigating over-tourism is to estimate tourist behavior using a tourist transition model and consequently adjust public transportation operations. In this study, we propose a construction method for a model of tourist transitions among tourist attractions based on tourist GPS trajectory data. We construct tourist transition models using actual trajectory data for tourists staying in the vicinity of Kyoto City. The results verify the model performance.
Article
The field of visual search has gained significant attention recently, particularly in the context of web search engines and e-commerce product search platforms. However, the abundance of web images presents a challenge for modern image retrieval systems, as they need to find both relevant and diverse images that maximize users’ satisfaction. In response to this challenge, we propose a non-dominated visual diversity re-ranking (NDVDR) method based on the concept of Pareto Optimality. To begin with, we employ a fast binary hashing method as a coarse-grained retrieval procedure. This allows us to efficiently obtain a subset of candidate images for subsequent re-ranking. Fed with this initial retrieved image results, the NDVDR performs a fine-grained re-ranking procedure for boosting both relevance and visual diversity among the top-ranked images. Recognizing the inherent conflict nature between the objectives of relevance and diversity, the re-ranking procedure is simulated as the analytical stage of a multi-criteria decision-making process, seeking the optimal trade-off between the two conflicting objectives within the initial retrieved images. In particular, a non-dominated sorting mechanism is devised that produces Pareto non-dominated hierarchies among images based on the Pareto dominance relation. Additionally, two novel measures are introduced for the effective characterization of the relevance and diversity scores among different images. We conduct experiments on three popular real-world image datasets and compare our re-ranking method with several state-of-the-art image search re-ranking methods. The experimental results validate that our re-ranking approach guarantees retrieval accuracy while simultaneously boosting diversity among the top-ranked images.
Article
3D model retrieval has been widely utilized in numerous domains, such as computer-aided design, digital entertainment, and virtual reality. Recently, many graph-based methods have been proposed to address this task by using multi-view information of 3D models. However, these methods are always constrained by many-to-many graph matching for the similarity measure between pairwise models. In this article, we propose a multi-view graph matching method (MVGM) for 3D model retrieval. The proposed method can decompose the complicated multi-view graph-based similarity measure into multiple single-view graph-based similarity measures and fusion. First, we present the method for single-view graph generation, and we further propose the novel method for the similarity measure in a single-view graph by leveraging both node-wise context and model-wise context. Then, we propose multi-view fusion with diffusion, which can collaboratively integrate multiple single-view similarities w.r.t. different viewpoints and adaptively learn their weights, to compute the multi-view similarity between pairwise models. In this way, the proposed method can avoid the difficulty in the definition and computation of the traditional high-order graph. Moreover, this method is unsupervised and does not require a large-scale 3D dataset for model learning. We conduct evaluations on four popular and challenging datasets. The extensive experiments demonstrate the superiority and effectiveness of the proposed method compared against the state of the art. In particular, this unsupervised method can achieve competitive performances against the most recent supervised and deep learning method.
Preprint
In the search and retrieval of multimedia objects, it is impractical to either manually or automatically extract the contents for indexing since most of the multimedia contents are not machine extractable, while manual extraction tends to be highly laborious and time-consuming. However, by systematically capturing and analyzing the feedback patterns of human users, vital information concerning the multimedia contents can be harvested for effective indexing and subsequent search. By learning from the human judgment and mental evaluation of users, effective search indices can be gradually developed and built up, and subsequently be exploited to find the most relevant multimedia objects. To avoid hovering around a local maximum, we apply the epsilon-greedy method to systematically explore the search space. Through such methodic exploration, we show that the proposed approach is able to guarantee that the most relevant objects can always be discovered, even though initially it may have been overlooked or not regarded as relevant. The search behavior of the present approach is quantitatively analyzed, and closed-form expressions are obtained for the performance of two variants of the epsilon-greedy algorithm, namely EGSE-A and EGSE-B. Simulations and experiments on real data set have been performed which show good agreement with the theoretical findings. The present method is able to leverage exploration in an effective way to significantly raise the performance of multimedia information search, and enables the certain discovery of relevant objects which may be otherwise undiscoverable.
Conference Paper
Full-text available
Multimedia search is an emerging area in information retrieval (IR) and recommender systems (RS) research. However, there is a lack of standardized audiovisual datasets that include rich content descriptors, which are a necessity in content-based IR and RS. The contributions of this paper are twofold: First, we present a new multimedia dataset of movie clips, named MFVCD-7K Multifaceted Video Clip Dataset, that comes with low-level and semantic multi-modal descriptions of their content (textual, audio, and visual). In addition, we showcase the use of this dataset for a novel content-based video clip retrieval and result diversification task we introduce. We investigate baseline algorithms for retrieval and diversification, and provide experimental results according to relevance and diversity measures. We believe that both dataset and baseline results constitute an important asset for the IR, RS, and multimedia communities.
Article
Today, diversifying the retrieval results of a certain query will improve customers’ search efficiency. Showing the multiple aspects of information provides users an overview of the object, which helps them fast target their demands. To discover aspects, research focuses on generating image clusters from initially retrieved results. As an effective approach, latent Dirichlet allocation (LDA) has been proved to have good performance on discovering high-level topics. However, traditional LDA is designed to process textual words, and it needs the input as discrete data. When we apply this algorithm to process continuous visual images, a common solution is to quantize the continuous features into discrete form by a bag-of-visual-words algorithm. During this process, quantization error will lead to information that inevitably is lost. To construct a topic model with complete visual information, this work applies Gaussian latent Dirichlet allocation (GLDA) on the diversity issue of image retrieval. In this model, traditional multinomial distribution is substituted with Gaussian distribution to model continuous visual features. In addition, we propose a two-phase spectral clustering strategy, called dual spectral clustering, to generate clusters from region level to image level. The experiments on the challenging landmarks of the DIV400 database show that our proposal improves relevance and diversity by about 10% compared to traditional topic models.
Technical Report
Full-text available
This paper describes the contributions of Vienna University of Technology (TUW) to the MediaEval 2015 Retrieving Diverse Social Images challenge. Our approach consists of 3 phases: (1) Precision-oriented-phase: in which we focus only on the relevance of the documents; (2) Recall-oriented-phase: in which we focus only on the diversity aspect; (3) Merging phase: in which we explore ways to find a balance between the relevance and diversity factors. We use two fusion methods for this last part. Our best run reached a F1@20 of 0.582.
Article
Full-text available
An efficient information retrieval system should be able to provide search results which are in the same time relevant for the query but which cover different aspects, i.e., diverse, of it. In this paper we address the issue of image search result diversification. We propose a new hybrid approach that integrates both the automatization power of the machines and the intelligence of human observers via an optimized multi-class Support Vector Machine (SVM) classifier-based relevance feedback (RF). In contrast to existing RF techniques which focus almost exclusively on improving the relevance of the results, the novelty of our approach is in considering in priority the diversification. We designed several diversification strategies which operate on top of the SVM RF and exploit the classifiers' output confidence scores. Experimental validation conducted on a publicly available image retrieval diversification dataset show the benefits of this approach which outperforms other state-of-the-art methods.
Article
Full-text available
Label propagation consists in annotating an unlabeled dataset starting from a set of labeled items. However, most current methods exploit only image similarity between labeled and unlabeled images in order to find propagation candidates, which may result, especially in very large datasets, in retrieving mostly near-duplicate images. While such approaches are technically correct, as they maximize the propagation precision, the resulting annotated dataset may not be as useful, since they lack intra-class variability within the set of images sharing the same label. In this paper, we propose an approach for label propagation which favors the propagation of an object’s label to a set of images representing as many different views of that object as possible, while at the same time preserving the relevance of the retrieved items to the query. Our method is based on a diversity-based clustering technique using a random forest framework and a label propagation approach which is able to effectively and efficiently propagate annotations using a similarity-based approach operating on clusters. The method was tested on a very large dataset of fish images achieving good performance in automated label propagation, ensuring diversification of the annotated items while preserving precision.
Article
Full-text available
This article addresses the issue of social image search result diversification. We propose a novel perspective for the diversification problem via Relevance Feedback (RF). Traditional RF introduces the user in the processing loop by harvesting feedback about the relevance of the search results. This information is used for recomputing a better representation of the data needed. The novelty of our work is in exploiting this concept in a completely automated manner via pseudo-relevance, while pushing in priority the diversification of the results, rather than relevance. User feedback is simulated automatically by selecting positive and negative examples with regard to relevance, from the initial query results. Unsupervised hierarchical clustering is used to re-group images according to their content. Diversification is finally achieved with a re-ranking approach. Experimental validation on Flickr data shows the advantages of this approach.
Article
Full-text available
Existing image retrieval systems exploit textual or/and visual information to return results. Retrieval is mostly focused on data themselves and disregards the data sources. In Web 2.0 platforms, the quality of annotations provided by different users can vary strongly. To account for this variability, we complement existing methods by introducing user tagging credibility in the retrieval process. Tagging credibility is automatically estimated by leveraging a large set of visual concept classifiers learned with Overfeat, a convolutional neural network (CNN) feature. A good image retrieval system should return results that are both relevant and diversified and here we tackle both challenges. Classically, we diversify results by using a k-Means algorithm and increase relevance by favoring images uploaded by users with good credibility estimates. Evaluation is performed on DIV400, a publicly available social image retrieval dataset and shows that our method is competitive with existing approaches.
Conference Paper
Full-text available
This paper provides an overview of the Retrieving Diverse Social Images task that is organized as part of the MediaEval 2014 Benchmarking Initiative for Multimedia Evaluation. The task addresses the problem of result diversification in the context of social photo retrieval. We present the task challenges, the proposed data set and ground truth, the required participant runs and the evaluation metrics.
Conference Paper
Full-text available
In this paper we address the issue of relevance feedback in the context of content-based image retrieval. We propose a method that uses an hierarchical cluster representation of the relevant and non-relevant images in a query. The main advantage of this strategy is in performing on the initial set of the retrieved images (user feedback is provided only once for a small number of retrieved images) instead of performing additional queries as most approaches do. Experimental tests conducted on several standard image databases and using state-of-the-art content descriptors (e.g. MPEG-7, SURF) show that the proposed method provides a significant improvement in the retrieval performance, outperforming some other classic approaches.
Conference Paper
Full-text available
We present EXTENT, an image annotation system that combines the context and content information to annotate images with metadata that cannot be reliably inferred from either the context or the content alone. EXTENT first applies contextual information for restricting the search scope in an image database and reducing the complexity of ensuing content analysis. It can then afford to use more expensive (hence more robust) algorithms for performing content analysis within the restricted database scope. Our experiments show that effectively combining content with context information can infer metadata with high accuracy
Article
Full-text available
This paper presents a novel graph-based framework which can combine context, consistency, and diversity cues for interactive image categorization. The image representation is first formed with visual keywords by dividing images into blocks and then performing clustering on these blocks. The context across visual keywords within an image is further captured by proposing a 2-D spatial Markov chain model. To develop a graph-based approach to image categorization, we incorporate intra-image context into a new class of kernel called spatial Markov kernel which can be used to define the affinity matrix for a graph. After graph construction with this kernel, the large unlabeled data can be exploited by graph-based semi-supervised learning through label propagation with inter-image consistency. For interactive image categorization, we further combine this semi-supervised learning with active learning by defining a new diversity-based data selection criterion using spectral embedding. Experiments then demonstrate that the proposed framework can achieve promising results.
Conference Paper
Full-text available
A novel method of relevance feedback is presented based on support vector machine learning in the content-based image retrieval system. A SVM classifier can be learned from training data of relevance images and irrelevance images marked by users. Using the classifier, the system can retrieve more images relevant to the query in the database efficiently. Experiments were carried out on a large-size database of 9918 images. It shows that the interactive learning and retrieval process can find correct images increasingly. It also shows the generalization ability of SVM under the condition of limited training samples
Conference Paper
Full-text available
High retrieval precision in content-based image retrieval can be attained by adopting relevance feedback mechanisms. In this paper we propose a weighted similarity measure based on the nearest-neighbor relevance feedback technique proposed by the authors. Each image is ranked according to a relevance score depending on nearest-neighbor distances from relevant and non-relevant images. Distances are computed by a weighted measure, the weights being related to the capability of feature spaces of representing relevant images as nearest-neighbors. This approach is proposed to weights individual features, feature subsets, and also to weight relevance scores computed from different feature spaces. Reported results show that the proposed weighting scheme improves the performances with respect to unweighed distances, and to other weighting schemes.
Conference Paper
Full-text available
The ImageCLEF Photo Retrieval Task 2009 focused on image retrieval and diversity. A new collection was utilised in this task consisting of approximately half a million images with English annotations. Queries were based on analysing search query logs and two different types were released: one containing information about image clusters; the other without. A total of 19 participants submitted 84 runs. Evaluation, based on Precision at rank 10 and Cluster Recall at rank 10, showed that participants were able to generate runs of high diversity and relevance. Findings show that submissions based on using mixed modalities performed best compared to those using only concept-based or content-based retrieval methods. The selection of query fields was also shown to affect retrieval performance. Submissions not using the cluster information performed worse with respect to diversity than those using this information. This paper summarises the ImageCLEFPhoto task for 2009.
Conference Paper
Full-text available
Due to the reliance on the textual information associated with an image, image search engines on the Web lack the discriminative power to deliver visually diverse search re- sults. The textual descriptions are key to retrieve relevant results for a given user query, but at the same time provide little information about the rich image content. In this paper we investigate three methods for visual di- versification of image search results. The methods deploy lightweight clustering techniques in combination with a dy- namic weighting function of the visual features, to best cap- ture the discriminative aspects of the resulting set of images that is retrieved. A representative image is selected from each cluster, which together form a diverse result set. Based on a performance evaluation we find that the out- come of the methods closely resembles human perception of diversity, which was established in an extensive clustering experiment carried out by human assessors.
Article
Full-text available
This paper focuses on developing a Fast And Semantics-Tailored (FAST) image retrieval methodology. Specifically, the contributions of FAST methodology to the CBIR literature include: (1) development of a new indexing method based on fuzzy logic to incorporate color, texture, and shape information into a region-based approach to improving the retrieval effectiveness and robustness; (2) development of a new hierarchical indexing structure and the corresponding hierarchical, elimination-based A* retrieval (HEAR) algorithm to significantly improve the retrieval efficiency without sacrificing the retrieval effectiveness; it is shown that HEAR is guaranteed to deliver a logarithm search in the average case; (3) employment of user relevance feedback to tailor the effective retrieval to each user's individualized query preference through the novel indexing tree pruning (ITP) and adaptive region weight updating (ARWU) algorithms. Theoretical analysis and experimental evaluations show that FAST methodology holds great promise in delivering fast and semantics-tailored image retrieval in CBIR.
Article
Full-text available
Color names are required in real-world applications such as image retrieval and image annotation. Traditionally, they are learned from a collection of labeled color chips. These color chips are labeled with color names within a well-defined experimental setup by human test subjects. However, naming colors in real-world images differs significantly from this experimental setting. In this paper, we investigate how color names learned from color chips compare to color names learned from real-world images. To avoid hand labeling real-world images with color names, we use Google Image to collect a data set. Due to the limitations of Google Image, this data set contains a substantial quantity of wrongly labeled data. We propose several variants of the PLSA model to learn color names from this noisy data. Experimental results show that color names learned from real-world images significantly outperform color names learned from labeled color chips for both image retrieval and image annotation.
Article
Full-text available
Development of content-based image retrieval (CBIR) techniques has suffered from the lack of standardized ways for describing visual image content. Luckily, the MPEG-7 international standard is now emerging as both a general framework for content description and a collection of specific agreed-upon content descriptors. We have developed a neural, self-organizing technique for CBIR. Our system is named PicSOM and it is based on pictorial examples and relevance feedback (RF). The name stems from "picture" and the self-organizing map (SOM). The PicSOM system is implemented by using tree structured SOMs. In this paper, we apply the visual content descriptors provided by MPEG-7 in the PicSOM system and compare our own image indexing technique with a reference system based on vector quantization (VQ). The results of our experiments show that the MPEG-7-defined content descriptors can be used as such in the PicSOM system even though Euclidean distance calculation, inherently used in the PicSOM system, is not optimal for all of them. Also, the results indicate that the PicSOM technique is a bit slower than the reference system in starting to find relevant images. However, when the strong RF mechanism of PicSOM begins to function, its retrieval precision exceeds that of the reference system.
Article
Full-text available
This paper presents an overview of color and texture descriptors that have been approved for the Final Committee Draft of the MPEG-7 standard. The color and texture descriptors that are described in this paper have undergone extensive evaluation and development during the past two years. Evaluation criteria include effectiveness of the descriptors in similarity retrieval, as well as extraction, storage, and representation complexities. The color descriptors in the standard include a histogram descriptor that is coded using the Haar transform, a color structure histogram, a dominant color descriptor, and a color layout descriptor. The three texture descriptors include one that characterizes homogeneous texture regions and another that represents the local edge distribution. A compact descriptor that facilitates texture browsing is also defined. Each of the descriptors is explained in detail by their semantics, extraction and usage. The effectiveness is documented by experimental results
Article
Full-text available
This paper presents a method for combining query-relevance with information-novelty in the context of text retrieval and summarization. The Maximal Marginal Relevance (MMR) criterion strives to reduce redundancy while maintaining query relevance in re-ranking retrieved documents and in selecting appropriate passages for text summarization. Preliminary results indicate some benefits for MMR diversity ranking in document retrieval and in single document summarization. The latter are borne out by the recent results of the SUMMAC conference in the evaluation of summarization systems. However, the clearest advantage is demonstrated in constructing non-redundant multi-document summaries, where MMR results are clearly superior to non-MMR passage selection. 1 Introduction With the continuing growth of online information, it has become increasingly important to provide improved mechanisms to find information quickly. Conventional IR systems rank and assimilate documents based on maximizing re...
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
An ever increasing part of communication between persons involve the use of pictures, due to the cheap availability of powerful cameras on smartphones, and the cheap availability of storage space. The rising popularity of social networking applications such as Facebook, Twitter, Instagram, and of instant messaging applications, such as WhatsApp, WeChat, is the clear evidence of this phenomenon, due to the opportunity of sharing in real-time a pictorial representation of the context each individual is living in. The media rapidly exploited this phenomenon, using the same channel, either to publish their reports, or to gather additional information on an event through the community of users. While the real-time use of images is managed through metadata associated with the image (i.e., the timestamp, the geolocation, tags, etc.), their retrieval from an archive might be far from trivial, as an image bears a rich semantic content that goes beyond the description provided by its metadata. It turns out that after more than 20 years of research on Content-Based Image Retrieval (CBIR), the giant increase in the number and variety of images available in digital format is challenging the research community. It is quite easy to see that any approach aiming at facing such challenges must rely on different image representations that need to be conveniently fused in order to adapt to the subjectivity of image semantics. This paper offers a journey through the main information fusion ingredients that a recipe for the design of a CBIR system should include to meet the demanding needs of users.
Article
Social media sharing websites like Flickr allow users to annotate images with free tags, which significantly contribute to the development of the web image retrieval and organization. Tag-based image search is an important method to find images contributed by social users in such social websites. However, how to make the top ranked result relevant and, with diversity, is challenging. In this paper, we propose a social re-ranking system for tag-based image retrieval with the consideration of an image's relevance and diversity. We aim at re-ranking images according to their visual information, semantic information, and social clues. The initial results include images contributed by different social users. Usually each user contributes several images. First, we sort these images by inter-user re-ranking. Users that have higher contribution to the given query rank higher. Then we sequentially implement intra-user re-ranking on the ranked user's image set, and only the most relevant image from each user's image set is selected. These selected images compose the final retrieved results. We build an inverted index structure for the social image dataset to accelerate the searching process. Experimental results on a Flickr dataset show that our social re-ranking method is effective and efficient.
Article
Sketch-based image retrieval often needs to optimize the trade-off between efficiency and precision. Index structures are typically applied to large-scale databases to realize efficient retrievals. However, the performance can be affected by quantization errors. Moreover, the ambiguousness of user-provided examples may also degrade the performance, when compared with traditional image retrieval methods. Sketch-based image retrieval systems that preserve the index structure are challenging. In this paper, we propose an effective sketch-based image retrieval approach with re-ranking and relevance feedback schemes. Our approach makes full use of the semantics in query sketches and the top ranked images of the initial results. We also apply relevance feedback to find more relevant images for the input query sketch. The integration of the two schemes results in mutual benefits and improves the performance of sketch-based image retrieval.
Article
Landmark summarization with diverse viewpoints is very important in landmark retrieval, as it can create a comprehensive description of a landmark for users. In this paper, we present an approach for summarizing a collection oflandmark images from diverse viewpoints. First, we group landmark images with content overlap by viewpoint album generation. Second, we model the relative viewpoint of each image within the viewpoint album based on the spatial layout of distinctive descriptors of a landmark. Third, we express the relative viewpoint of an image with a 4-dimensional viewpoint vector, including horizontal, vertical, scale, and rotation. Finally, we summarize the landmarks in terms of viewpoints. Experimental results show the effectiveness of the proposed landmark summarization approach .
Article
Diversification of search results allows for better and faster search, gaining knowledge about different perspectives and viewpoints on retrieved information sources. Recently various methods for diversification of image retrieval results have been proposed, mainly using textual information or techniques imported from the natural language processing domain. However, images contain much more information than their textual descriptions and the use of visual features deserves special attention in this context. Visual saliency provides information about parts of the image perceived as most important, which are instinctively targeted by humans when shooting a photo or looking at a picture. For this reason we propose to exploit such information to improve diversification of search results. To this purpose, we introduce a saliency-based method to re-rank the results of a query and we show that it can achieve significantly better performances as compared to the baseline approach. Experimental validation conducted on a number of queries applied to various datasets demonstrates the potential of the use of saliency information for the diversification of image retrieval results.
Article
Landmark image classification attracts increasing research attention due to its great importance in real applications, ranging from travel guide recommendation to 3-D modelling and visualization of geolocation. While large amount of efforts have been invested, it still remains unsolved by academia and industry. One of the key reasons is the large intra-class variance rooted from the diverse visual appearance of landmark images. Distinguished from most existing methods based on scalable image search, we approach the problem from a new perspective and model landmark classification as multi-modal categorization , which enjoys advantages of low storage overhead and high classification efficiency. Toward this goal, a novel and effective feature representation, called hierarchical multi-modal exemplar (HMME) feature, is proposed to characterize landmark images. In order to compute HMME, training images are first partitioned into the regions with hierarchical grids to generate candidate images and regions. Then, at the stage of exemplar selection, hierarchical discriminative exemplars in multiple modalities are discovered automatically via iterative boosting and latent region label mining. Finally, HMME is generated via a region-based locality-constrained linear coding (RLLC), which effectively encodes semantics of the discovered exemplars into HMME. Meanwhile, dimension reduction is applied to reduce redundant information by projecting the raw HMME into lower-dimensional space. The final HMME enjoys advantages of discriminative and linearly separable. Experimental study has been carried out on real world landmark datasets, and the results demonstrate the superior performance of the proposed approach over several state-of-the-art techniques.
Article
From social media has emerged continuous needs for automatic travel recommendations. Collaborative filtering (CF) is the most well-known approach. However, existing approaches generally suffer from various weaknesses. For example , sparsity can significantly degrade the performance of traditional CF. If a user only visits very few locations, accurate similar user identification becomes very challenging due to lack of sufficient information for effective inference. Moreover, existing recommendation approaches often ignore rich user information like textual descriptions of photos which can reflect users’ travel preferences. The topic model (TM) method is an effective way to solve the “sparsity problem,” but is still far from satisfactory. In this paper, an author topic model-based collaborative filtering (ATCF) method is proposed to facilitate comprehensive points of interest (POIs) recommendations for social users. In our approach, user preference topics, such as cultural, cityscape, or landmark, are extracted from the geo-tag constrained textual description of photos via the author topic model instead of only from the geo-tags (GPS locations). Advantages and superior performance of our approach are demonstrated by extensive experiments on a large collection of data.
Article
Microblogging services have revolutionized the way people exchange information. Confronted with the ever-increasing numbers of social events and the corresponding microblogs with multimedia contents, it is desirable to provide visualized summaries to help users to quickly grasp the essence of these social events for better understanding. While existing approaches mostly focus only on text-based summary, microblog summarization with multiple media types (e.g., text, image, and video) is scarcely explored. In this paper, we propose a multimedia social event summarization framework to automatically generate visualized summaries from the microblog stream of multiple media types. Specifically, the proposed framework comprises three stages, as follows. 1) A noise removal approach is first devised to eliminate potentially noisy images. An effective spectral filtering model is exploited to estimate the probability that an image is relevant to a given event. 2) A novel cross-media probabilistic model, termed Cross-Media-LDA (CMLDA), is proposed to jointly discover subevents from microblogs of multiple media types. The intrinsic correlations among these different media types are well explored and exploited for reinforcing the cross-media subevent discovery process. 3) Finally, based on the cross-media knowledge of all the discovered subevents, a multimedia microblog summary generation process is designed to jointly identify both representative textual and visual samples, which are further aggregated to form a holistic visualized summary. We conduct extensive experiments on two real-world microblog datasets to demonstrate the superiority of the proposed framework as compared to the state-of-the-art approaches.
Article
This paper proposes a new bag-of-visual phrase (BoP) approach for mobile landmark recognition based on discriminative learning of category-dependent visual phrases. Many previous landmark recognition works adopt a bag-of-words (BoW) method which ignores the co-occurrence relationship between neighboring visual words in an image. Although some works that focus on visual phrase learning have appeared, they mainly construct a generalized phrase dictionary from all categories for recognition, which lacks descriptive capability for a specific category. Another shortcoming of these works is the hard assignment of numerous feature sets to a limited number of phrases, which causes some useful feature sets to be discarded, and yields information loss. In view of this, this paper presents a discriminative soft BoP approach for mobile landmark recognition. The candidate phrases defined as adjacent pairwise codewords are first generated for each category. The important candidates are then selected through a proposed discriminative visual phrase (DVP) selection approach to form the BoP dictionary. Finally, a soft encoding method is developed to quantize each image into a BoP histogram. The context information such as location and direction captured by mobile devices is also integrated with the proposed BoP-based content analysis for landmark recognition. Experimental results on two datasets show that the proposed method is effective in mobile landmark recognition.
Article
In this paper, we present a novel approach for automatic visual summarization of a geographic area that exploits user-contributed images and related explicit and implicit metadata collected from popular content-sharing websites. By means of this approach, we search for a limited number of representative but diverse images to represent the area within a certain radius around a specific location. Our approach is based on the random walk with restarts over a graph that models relations between images, visual features extracted from them, associated text, as well as the information on the uploader and commentators. In addition to introducing a novel edge weighting mechanism, we propose in this paper a simple but effective scheme for selecting the most representative and diverse set of images based on the information derived from the graph. We also present a novel evaluation protocol, which does not require input of human annotators, but only exploits the geographical coordinates accompanying the images in order to reflect conditions on image sets that must necessarily be fulfilled in order for users to find them representative and diverse. Experiments performed on a collection of Flickr images, captured around 207 locations in Paris, demonstrate the effectiveness of our approach.
Article
We are living in an Age of Information where the amount of accessible data from science and culture is almost limitless. However, this also means that finding an item of interest is increasingly difficult, a digital needle in the proverbial haystack. In this article, we focus on the topic of content-based image retrieval using interactive search techniques, i.e., how does one interactively find any kind of imagery from any source, regardless of whether it is photographic, MRI or X-ray? We highlight trends and ideas from over 170 recent research papers aiming to capture the wide spectrum of paradigms and methods in interactive search, including its subarea relevance feedback. Furthermore, we identify promising research directions and several grand challenges for the future.
Article
Cluster analysis is a collective term covering a wide variety of techniques for delineating natural groups or clusters in data sets. This book integrates the necessary elements of data analysis, cluster analysis, and computer implementation to cover the complete sequence of steps from raw data to the finished analysis. The author develops a conceptual and philosophical basis for using cluster analysis as a tool of discovery and applies it systematically throughout the book. He provides a comprehensive discussion of variables, scales, and measures of association that establishes a sound basis for constructing an operational definition of similarity tailored to the needs of any particular operational definition of similarity tailored to the needs of any particular problem, and devotes special attention to the problems of analyzing data sets containing mixtures of nominal, ordinal, and interval variables. (Author)
Article
This paper proposes an effective approach for content-based sketch retrieval. It addresses three characteristics as follows. Firstly, both structural relations and global shape descriptors are combined to represent sketch content. Secondly, feature weighting and combination are performed to obtain a reasonable mechanism for similarity calculation. Finally, relevance feedback based on biased SVM (BSVM) algorithm is employed to capture user’s query interests online and thus improve retrieval performance. Experiments prove the effectiveness of our proposed method in sketch retrieval.
Article
Research has been devoted in recent years to relevance feedback as an effective solution to improve performance of image similarity search. However, few methods using the relevance feedback are currently available to perform relatively complex queries on large image databases. In the case of complex image queries, images with relevant concepts are often scattered across several visual regions in the feature space. This leads to adapting multiple regions to represent a query in the feature space. Therefore, it is necessary to handle disjunctive queries in the feature space.In this paper, we propose a new adaptive classification and cluster-merging method to find multiple regions and their arbitrary shapes of a complex image query. Our method achieves the same high retrieval quality regardless of the shapes of query regions since the measures used in our method are invariant under linear transformations. Extensive experiments show that the result of our method converges to the user’s true information need fast, and the retrieval quality of our method is about 22% in recall and 20% in precision better than that of the query expansion approach, and about 35% in recall and about 31% in precision better than that of the query point movement approach, in MARS.
Conference Paper
Scene categorization is a fundamental problem in computer vision. However, scene understanding research has been constrained by the limited scope of currently-used databases which do not capture the full variety of scene categories. Whereas standard databases for object categorization contain hundreds of different classes of objects, the largest available dataset of scene categories contains only 15 classes. In this paper we propose the extensive Scene UNderstanding (SUN) database that contains 899 categories and 130,519 images. We use 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition and establish new bounds of performance. We measure human scene classification performance on the SUN database and compare this with computational methods. Additionally, we study a finer-grained scene representation to detect scenes embedded inside of larger scenes.
Conference Paper
Can we leverage the community-contributed collections of rich media on the web to automatically generate represen- tative and diverse views of the world's landmarks? We use a combination of context- and content-based tools to gener- ate representative sets of images for location-driven features and landmarks, a common search task. To do that, we us- ing location and other metadata, as well as tags associated with images, and the images' visual features. We present an approach to extracting tags that represent landmarks. We show how to use unsupervised methods to extract represen- tative views and images for each landmark. This approach can potentially scale to provide better search and represen- tation for landmarks, worldwide. We evaluate the system in the context of image search using a real-life dataset of 110,000 images from the San Francisco area.
Article
Recently, video search reranking has been an effective mechanism to improve the initial text-based ranking list by incorporating visual consistency among the result videos. While existing methods attempt to rerank all the individual result videos, they suffer from several drawbacks. In this article, we propose a new video reranking paradigm called cluster-based video reranking (CVR). The idea is to first construct a video near-duplicate graph representing the visual similarity relationship among videos, followed by identifying the near-duplicate clusters from the video near-duplicate graph, then ranking the obtained near-duplicate clusters based on cluster properties and intercluster links, and finally for each ranked cluster, a representative video is selected and returned. Compared to existing methods, the new CVR ranks clusters and exhibits several advantages, including superior reranking by utilizing more reliable cluster properties, fast reranking on a small number of clusters, diverse and representative results. Particularly, we formulate the near-duplicate cluster identification as a novel maximally cohesive subgraph mining problem. By leveraging the designed cluster scoring properties indicating the cluster's importance and quality, random walk is applied over the near-duplicate cluster graph to rank clusters. An extensive evaluation study proves the novelty and superiority of our proposals over existing methods. Copyright 2011 ACM, Inc.
Article
Includes index. Corrections and appendix added to thesis.--Foreward. Thesis (Ph. D.)--University of Texas at Austin, 1972. Bibliography: p. 496-508.
Conference Paper
We formulate the problem of scene summarization as selecting a set of images that efficiently represents the visual content of a given scene. The ideal summary presents the most interesting and important aspects of the scene with minimal redundancy. We propose a solution to this problem using multi-user image collections from the Internet. Our solution examines the distribution of images in the collection to select a set of canonical views to form the scene summary, using clustering techniques on visual features. The summaries we compute also lend themselves naturally to the browsing of image collections, and can be augmented by analyzing user-specified image tag data. We demonstrate the approach using a collection of images of the city of Rome, showing the ability to automatically decompose the images into separate scenes, and identify canonical views for each scene.
Conference Paper
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.
Conference Paper
In this paper we propose a robust technique for image focus measure based on discrete wavelet transform (DWT). We suggest the absolute sum of the second-level detailed image of DWT as focus measure. In the experiment the absolute-sum based measure shows equally well performance as the energy-based measure while enhancing computation efficiency. Comparison to other benchmarking measures is also discussed. Our measure exhibits more robustness to Gaussian noise. Moreover, we show that setting a threshold on the wavelet coefficients can dramatically increase the discriminative power of the focus measure in noisy condition.
Conference Paper
Relevance feedback schemes using linear/quadratic estimators have been applied in content-based image retrieval to improve retrieval performance significantly. One major difficulty in relevance feedback is to estimate the support of target images in high dimensional feature space with a relatively small number of training samples. We develop a novel scheme based on one-class SVM, which fits a tight hyper-sphere in the nonlinearly transformed feature space to include most of the target images based on positive examples. The use of a kernel provides us an elegant way to deal with nonlinearity in the distribution of the target images, while the regularization term in SVM provides good generalization ability. To validate the efficacy of the proposed approach, we test it on both synthesized data and real-world images. Promising results are achieved in both cases
Conference Paper
Technology advances in the areas of image processing (IP) and information retrieval (IR) have evolved separately for a long time. However, successful content-based image retrieval systems require the integration of the two. There is an urgent need to develop integration mechanisms to link the image retrieval model to text retrieval model, such that the well established text retrieval techniques can be utilized. Approaches of converting image feature vectors (IF domain) to weighted-term vectors (IR domain) are proposed in this paper. Furthermore, the relevance feedback technique from the IR domain is used in content-based image retrieval to demonstrate the effectiveness of this conversion. Experimental results show that the image retrieval precision increases considerably by using the proposed integration approach
Conference Paper
This paper evaluates the performance both of some texture measures which have been successfully used in various applications and of some new promising approaches. For classification a method based on Kullback discrimination of sample and prototype distributions is used. The classification results for single features with one-dimensional feature value distributions and for pairs of complementary features with two-dimensional distributions are presented
Article
Content-based image retrieval (CBIR) has become one of the most active research areas in the past few years. Many visual feature representations have been explored and many systems built. While these research efforts establish the basis of CBIR, the usefulness of the proposed approaches is limited. Specifically, these efforts have relatively ignored two distinct characteristics of CBIR systems: (1) the gap between high-level concepts and low-level features, and (2) the subjectivity of human perception of visual content. This paper proposes a relevance feedback based interactive retrieval approach, which effectively takes into account the above two characteristics in CBIR. During the retrieval process, the user's high-level query and perception subjectivity are captured by dynamically updated weights based on the user's feedback. The experimental results over more than 70000 images show that the proposed approach greatly reduces the user's effort of composing a query, and captures the user's information need more precisely
Article
Leaming-enhanced relevance feedback is one of the most promising and active research directions in recent year's content-based image retrieval. However, the existing approaches either require prior knowledge of the data or converge slowly, thus not con-effective. Motivated by the successful history of optimal adaptive filters, we present a new approach to interactive image retrieval based on an adaptive tree similarity model to solve these difficulties. The proposed tree model is a hierarchical non-linear Boolean representation of a user's query concept. Each path of the tree is a clustering pattern of the feedback examples, which is so small and local in the feature space that it can be approximated by a linear model nicely. Because of the linearity, the parameters of the similartiy model are better learned by the optimal adaptive filter, which does not require any prior knowledge of the data and supports incremental learning with fast convergence rate. The proposed approach is simple to implement and achieves better performance than the most approaches. To illustrate the performance of the proposed approach, extensive experiments have been carded out on a large heterogeneous image collection with 17,000 images, which render promising results on a wide variety of queries.
Article
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not, adequately address the problem of large datasets and mininization of I/O costs. This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively. We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLA RANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.