Article

Multimodal latent topic analysis for image collection summarization

Authors:
  • Universidad Nacional de Colombia, Bogotá, Colombia
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a multimodal latent topic analysis method for the construction of image collection summaries. The method automatically selects a set of prototypical images from a large set of retrieved images for a given query. We define an image collection summary as a subset of images from a collection, which is visually and semantically representative. To build such a summary we propose MICS (Multimodal Image Collection Summarization), a method that combines textual and visual modalities in a common latent space, which allows to find a subset of images from which the whole collection can be reconstructed. Experiments were conducted on two collections of tagged images demonstrating the ability of the approach to build summaries with representative visual and semantic contents. The method was evaluated using objective measures, reconstruction error and diversity of the summary, showing competitive results when compared to other summarization approaches.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Previous methods typically generate summaries based on low level visual features or image tags [7], [8]. The output summary is a subset of the collection, in terms of images or image patches. ...
... Each group is then summarized separately by clustering its members based on visual features. Alternatively, word tags can be used together with visual features in a multimodal analysis that builds a latent space whose elements reconstruct the entire dataset [7]. The authors in [21] assign class labels to each image with a generic classifier, then map these labels to domain concepts using the WordNet and DBpedia ontologies. ...
... To assess SImS results, we rely on two frequently used evaluation metrics: coverage [21], [52] and diversity [7], [16]. Since ground truth summaries are not available, other evaluation metrics such as ROUGE [53], V-ROUGE [16] and VERT [54] are not applicable. ...
Article
Full-text available
Applications such as providing a preview of personal albums (e.g., Google Photos) or suggesting thematic collections based on user interests (e.g., Pinterest) require a semantically-enriched image representation, which should be more informative with respect to simple low-level visual features and image tags. To this aim, we propose an image collection summarization technique based on frequent subgraph mining. We represent images with a novel type of scene graphs including fine-grained relationship types between objects. These scene graphs are automatically derived by our method. The resulting summary consists of a set of frequent subgraphs describing the underlying patterns of the image dataset. Our results are interpretable and provide more powerful semantic information with respect to previous techniques, in which the summary is a subset of the collection in terms of images or image patches. The experimental evaluation shows that the proposed technique yields non-redundant summaries, with a high diversity of the discovered patterns.
... Conversely, we asked the collection's owners to select the most important photos according to their memories and perceptions, without any mention to coverage or diversity. Besides [365,372], other works in the literature rely on external knowledge to accomplish the task of image summarization [60,353,439]. Camargo et al. [60] combine textual and visual contents of a collection in the same latent semantic space, using Convex Non-Negative Matrix Factorization, to generate multimodal image collection summaries. Domain-specific ontologies are required as further input in [353]. ...
... Besides [365,372], other works in the literature rely on external knowledge to accomplish the task of image summarization [60,353,439]. Camargo et al. [60] combine textual and visual contents of a collection in the same latent semantic space, using Convex Non-Negative Matrix Factorization, to generate multimodal image collection summaries. Domain-specific ontologies are required as further input in [353]. ...
... Regarding their ages, 60.0% of the participants are between 20 and 30 years, 25.7% between 30 and 40, 11.4% between 40 and 50, 2.9% between 50 and 60. Previous works mostly consider either public photo collections, for instance available on social media like Facebook and Flickr [60,340,353,420], or pictures from a shared event in which all the evaluators took part [416]. One difficulty we see with using public collections of photos from different people, even if they attended the same event, is that according to the different experiences of the individuals in the event they might also have a different level of appreciation for the same photo, thus influencing their decisions. ...
Chapter
Full-text available
Thanks to the spread of digital photography and available devices, taking photographs has become effortless and tolerated nearly everywhere. This makes people easily ending up with hundreds or thousands of photos, for example, when returning from a holiday trip or taking part in ceremonies, concerts, and other events. Furthermore, photos are also taken of more mundane motives, such as food and aspects of everyday life, further increasing the number of photos to be dealt with. The decreased prices of storage devices make dumping the whole set of photos common and affordable. However, this practice frequently makes the stored collections a kind of dark archives, which are rarely accessed and enjoyed again in the future. The big size of the collections makes revisiting them time demanding. This suggests to identify, with the support of automated methods, the sets of most important photos within the whole collections and to invest some preservation effort for keeping them accessible over time. Evaluating the importance of photos to their owners is a complex process, which is often driven by personal attachment, memories behind the content, and personal tastes that are difficult to capture automatically. Therefore, to better understand the selection process for photo preservation and future revisiting, the first part of this chapter presents a user study on a photo selection task where participants selected subsets of most important pictures from their own collections. In the second part of this chapter, we present methods to automatically select important photos from personal collections, in light of the insights emerged from the user study. We model a notion of photo importance driven by user expectations, which represents what photos users perceive as important and would have selected. We present an expectation-oriented method for photo selection, where information at both photo and collection levels is considered to predict the importance of photos.
... Due to the lack of ground truth for this task, we use common metrics that are used in image collection scene-graph summarization tasks [2], [19]; similarity [16], [49], [50], coverage [28], [51], and diversity [52], [53] of a generated scene graph to the ground-truth scene graph of each image. However, most evaluation techniques focus on estimating the generating precision, in which the evaluation score tends to increase based on the quantity of the generated results. ...
... For multiple-image scene-graph summarization, we evaluate the proposed method for image-collection scene-graph summarization on the MS-COCO dataset. Due to the lack of ground truth, we follow the common practice in the evaluation of scene graph generation in three perspectives; ''Coverage'' [28], [51], ''Diversity'' [52], [53], and ''Similarity'' [49], [50]. For the Coverage evaluation, we follow the graph theory to estimate the coverage of a generated scene graph to ground-truth scene graphs. ...
Article
Full-text available
Summarization tasks aim to summarize multiple pieces of information into a short description or representative information. A text summarization task is a task that summarizes textual information into a short description, whereas in an image collection summarization task, also known as the photo album summarization task, the goal is to find the representative visual information of all images in the collection. In recent years, scene-graph generation has shown the advantage of describing the visual contexts of a single-image, and incorporating external knowledge into the scene-graph generation model has also given effective directions for unseen single-image scene-graph generation. Following this trend, in this paper, we propose a novel scene-graph-based image-collection summarization model. The key idea of the proposed method is to enhance the relation predictor toward relationships between images in an image collection incorporating knowledge graphs as external knowledge for training a model. To evaluate the proposed method, we build an extended annotated MS-COCO dataset for this task and introduce an evaluation process that focuses on estimating the similarity between a summarized scene graph and ground-truth scene graphs. Traditional evaluation focuses on calculating precision and recall scores, which involve true positive predictions without balancing precision and recall. Meanwhile, the proposed evaluation process focuses on calculating the F-score of the similarity between a summarized scene graph and ground-truth scene graphs which aims to balance both false positives and false negatives. Experimental results show that the use of external knowledge in enhancing the relation predictor achieves better results compared with existing methods.
... al [30] suggested a technique that first selected representative topics using a text mining method and then chose images corresponded to the chosen topics as representative images. In [31], authors employed non-negative matrix factorization first in the textual domain and then applied the textual latent topics' outcomes to perform non-negative matrix factorization in the visual domain for the summarization purpose. Chen and colleagues introduced a multi-modal Recurrent Neural Network (RNN) approach for extractive summarization of documents with images. ...
Article
Full-text available
The size of internet image collections is increasing drastically. As a result, new techniques are required to facilitate users in browsing, navigation, and summarization of these large volume collections. Image collection summarization methods present users with a set of exemplar images as the most representative ones from the initial image collection. In this study, an image collection summarization technique was introduced according to semantic hierarchies among them. In the proposed approach, images were mapped to the nodes of a pre-defined domain ontology. In this way, a semantic hierarchical classifier was used, which finally mapped images to different nodes of the ontology. We made a compromise between the degree of freedom of the classifier and the goodness of the summarization method. The summarization was done using a group of high-level features that provided a semantic measurement of information in images. Experimental outcomes indicated that the introduced image collection summarization method outperformed the recent techniques for the summarization of image collections.
... Many researchers have developed automatic or interactive summarization systems by using handcrafted (color, texture, and edge) or deep features to generate a satisfactory summary of an image collection [5], [6], [7]. Automatic image collection summarization typically uses a content analysis scheme, such as latent topic analysis [8], which may not always provide a desirable summary of photos that satisfy the information integrity and visual aesthetics. Optimal visualization is seldom considered in current image collection summarization techniques. ...
Article
Full-text available
With the surge of images in the information era, people demand an effective and accurate way to access meaningful visual information. Accordingly, effective and accurate communication of information has become indispensable. In this work, we propose a content-based approach that automatically generates a clear and informative visual summarization based on design principles and cognitive psychology to represent image collections. We first introduce a novel method to make representative and nonredundant summarizations of image collections, thereby ensuring data cleanliness and emphasizing important information. Then, we propose a tree-based algorithm with a two-step optimization strategy to generate the final layout that operates as follows: (1) an initial layout is created by constructing a tree randomly based on the grouping results of the input image set; (2) the layout is refined through a coarse adjustment in a greedy manner, followed by gradient back propagation drawing on the training procedure of neural networks. We demonstrate the usefulness and effectiveness of our method via extensive experimental results and user studies. Our visual summarization algorithm can precisely and efficiently capture the main content of image collections better than alternative methods or commercial tools.
... [154] extract SIFT features and use a modification of RANSAC [39] plus Affinity Propagation clustering [43] for finding representative images. If there is accompanying textual or social information for the images, other approaches exist (e.g., see [18,55,121]). ...
... [154] extract SIFT features and use a modification of RANSAC [39] plus Affinity Propagation clustering [43] for finding representative images. If there is accompanying textual or social information for the images, other approaches exist (e.g., see [18,55,121]). ...
Preprint
It is said that beauty is in the eye of the beholder. But how exactly can we characterize such discrepancies in interpretation? For example, are there any specific features of an image that makes person A regard an image as beautiful while person B finds the same image displeasing? Such questions ultimately aim at explaining our individual ways of interpretation, an intention that has been of fundamental importance to the social sciences from the beginning. More recently, advances in computer science brought up two related questions: First, can computational tools be adopted for analyzing ways of interpretation? Second, what if the "beholder" is a computer model, i.e., how can we explain a computer model's point of view? Numerous efforts have been made regarding both of these points, while many existing approaches focus on particular aspects and are still rather separate. With this paper, in order to connect these approaches we introduce a theoretical framework for analyzing interpretation, which is applicable to interpretation of both human beings and computer models. We give an overview of relevant computational approaches from various fields, and discuss the most common and promising application areas. The focus of this paper lies on interpretation of text and image data, while many of the presented approaches are applicable to other types of data as well.
... En otra perspectiva, existe una alta tendencia a estudiar al artefacto (medio, modo, forma material o simbólica, entre otros) aislado de la comunidad, tratando de hallar en su materialidad los niveles de evolución y distribución de la información Xiqin Liu, 2015). Así mismo, se encuentran estudios que proponen identificar el comportamiento de los sujetos frente a las orquestaciones de modos o recursos en las formas de representación (Camargo & González, 2016;R. M. Mateo, 2015). ...
Thesis
Full-text available
This doctoral research makes an epistemological approach to the concept of intercreativity from a phenomenographic perspective, which is modeled in function of enhance design and creation of virtual learning environments. Its approach begins from recognition of a parallel way to teach and learn in physical presence with the developed and implemented in virtual presence in last 20 years. Therefore, there is an interest in understanding the relationships between interactivity and creativity in virtual communities of practice. In these their members, fruit of their different motivations, knowledge and experiences in a field, design ideal ways to manage and transmit their learning (Full text in Spanish).
... Modality fusion has received increasing attention in various recognition tasks. For example, speech recognition (Sun et al., 2016), video classification (Liu et al., 2016a), and image summarization (Camargo and González, 2016). There are mainly two types of fusion strategy for building multimodal models in state-of-the-art research, namely Feature-Level (FL) fusion (also known as "early fusion") which combines features from different modalities before performing recognition, and Decision-Level (DL) fusion (also known as "late fusion") which combines the predictions and their probabilities given by each unimodal model for the multimodal model to make the final decision (D'Mello and Kory, 2012). ...
Thesis
Full-text available
Automatic emotion recognition has long been a focus of Affective Computing. It has become increasingly apparent that awareness of human emotions in Human-Computer Interaction (HCI) is crucial for advancing related technologies, such as dialogue systems. However, performance of current automatic emotion recognition is disappointing compared to human performance. Current research on emotion recognition in spoken dialogue focuses on identifying better feature representations and recognition models from a data-driven point of view. The goal of this thesis is to explore how incorporating prior knowledge of human emotion recognition in the automatic model can improve state-of-the-art performance of automatic emotion recognition in spoken dialogue. Specifically, we study this by proposing knowledge-inspired features representing occurrences of disfluency and non-verbal vocalisation in speech, and by building a multimodal recognition model that combines acoustic and lexical features in a knowledge-inspired hierarchical structure. In our study, emotions are represented with the Arousal, Expectancy, Power, and Valence emotion dimensions. We build unimodal and multimodal emotion recognition models to study the proposed features and modelling approach, and perform emotion recognition on both spontaneous and acted dialogue. Psycholinguistic studies have suggested that DISfluency and Non-verbal Vocalisation (DIS-NV) in dialogue is related to emotions. However, these affective cues in spoken dialogue are overlooked by current automatic emotion recognition research. Thus, we propose features for recognizing emotions in spoken dialogue which describe five types of DIS-NV in utterances, namely filled pause, filler, stutter, laughter, and audible breath. Our experiments show that this small set of features is predictive of emotions. Our DIS-NV features achieve better performance than benchmark acoustic and lexical features for recognizing all emotion dimensions in spontaneous dialogue. Consistent with Psycholinguistic studies, the DIS-NV features are especially predictive of the Expectancy dimension of emotion, which relates to speaker uncertainty. Our study illustrates the relationship between DIS-NVs and emotions in dialogue, which contributes to Psycholinguistic understanding of them as well. Note that our DIS-NV features are based on manual annotations, yet our long-term goal is to apply our emotion recognition model to HCI systems. Thus, we conduct preliminary experiments on automatic detection of DIS-NVs, and on using automatically detected DIS-NV features for emotion recognition. Our results show that DIS-NVs can be automatically detected from speech with stable accuracy, and auto-detected DIS-NV features remain predictive of emotions in spontaneous dialogue. This suggests that our emotion recognition model can be applied to a fully automatic system in the future, and holds the potential to improve the quality of emotional interaction in current HCI systems. To study the robustness of the DIS-NV features, we conduct cross-corpora experiments on both spontaneous and acted dialogue. We identify how dialogue type influences the performance of DIS-NV features and emotion recognition models. DIS-NVs contain additional information beyond acoustic characteristics or lexical contents. Thus, we study the gain of modality fusion for emotion recognition with the DIS-NV features. Previous work combines different feature sets by fusing modalities at the same level using two types of fusion strategies: Feature-Level (FL) fusion, which concatenates feature sets before recognition; and Decision-Level (DL) fusion, which makes the final decision based on outputs of all unimodal models. However, features from different modalities may describe data at different time scales or levels of abstraction. Moreover, Cognitive Science research indicates that when perceiving emotions, humans make use of information from different modalities at different cognitive levels and time steps. Therefore, we propose a HierarchicaL (HL) fusion strategy for multimodal emotion recognition, which incorporates features that describe data at a longer time interval or which are more abstract at higher levels of its knowledge-inspired hierarchy. Compared to FL and DL fusion, HL fusion incorporates both inter- and intra-modality differences. Our experiments show that HL fusion consistently outperforms FL and DL fusion on multimodal emotion recognition in both spontaneous and acted dialogue. The HL model combining our DIS-NV features with benchmark acoustic and lexical features improves current performance of multimodal emotion recognition in spoken dialogue. To study how other emotion-related tasks of spoken dialogue can benefit from the proposed approaches, we apply the DIS-NV features and the HL fusion strategy to recognize movie-induced emotions. Our experiments show that although designed for recognizing emotions in spoken dialogue, DIS-NV features and HL fusion remain effective for recognizing movie-induced emotions. This suggests that other emotion-related tasks can also benefit from the proposed features and model structure.
... A monomodal collection contains items from a single mode, such as the Wikipedia dataset used in Hong and Si (2012) . By contrast, a multimodal collection contains objects from different modes, such as in the work of Yilmaz, Gulen, Yazici, and Kitsuregawa (2012) , which manages video and text; or the work of Camargo and González (2016) , which employs two data sets of images, Flickr4Concepts and MIRFlickr ( Huiskes & Lew, 2008 ). ...
Article
Interactive multimodal information retrieval systems (IMIR) increase the capabilities of traditional search systems, by adding the ability to retrieve information of different types (modes) and from different sources. This article describes a formal model for interactive multimodal information retrieval. This model includes formal and widespread definitions of each component of an IMIR system. A use case that focuses on information retrieval regarding sports validates the model, by developing a prototype that implements a subset of the features of the model. Adaptive techniques applied to the retrieval functionality of IMIR systems have been defined by analysing past interactions using decision trees, neural networks, and clustering techniques. This model includes a strategy for selecting sources and combining the results obtained from every source. After modifying the strategy of the prototype for selecting sources, the system is re-evaluated using classification techniques. This evaluation compares the normalised discounted cumulative gain (NDCG) measure obtained using two different approaches: the multimodal system using a baseline strategy based on predefined rules as a source selection strategy, and the same multimodal system with the functionality adapted by past user interactions. In the adapted system, a final value of 81,54% was obtained for the NDCG.
... In their approach, most representative topics are selected by text mining approaches and then images related to the selected topics are chosen as representative images. Camargo and colleagues [6] apply non negative matrix factorization to find latent topics in textual space and use the result of textual latent topics to do the nonnegative matrix factorization in visual space and do the summarization. ...
Article
Full-text available
With the advent of digital cameras, the number of digital images is on the increase. As a result, image collection summarization systems are proposed to provide users with a condense set of summary images as a representative set to the original high volume image set. In this paper, a semantic knowledge-based approach for image collection summarization is presented. Despite ontology and knowledge-based systems have been applied in other areas of image retrieval and image annotation, most of the current image summarization systems make use of visual or numeric metrics for conducting the summarization. Also, some image summarization systems jointly model visual data of images together with their accompanying textual or social information, while these side data are not available out of the context of web or social images. The main motivation of using ontology approach in this study is its ability to improve the result of computer vision tasks by the additional knowledge which it provides to the system. We defined a set of ontology based features to measure the amount of semantic information contained in each image. A semantic similarity graph was made based on semantic similarities. Summary images were then selected based on graph centrality on the similarity graph. Experimental results showed that the proposed approach worked well and outperformed the current image summarization systems.
Article
This work proposes an efficient Summary Caption Technique(SCT) which considers the multimodal summary and image captions as input to retrieve the correspondence images from the captions that are highly influential to the multimodal summary. Matching a multimodal summary with an appropriate image is a challenging task in computer vision (CV) and natural language processing (NLP) fields. Merging of these fields are tedious, though the research community has steadily focused on the cross-modal retrieval. These issues include the visual question-answering, matching queries with the images, and semantic relationship matching between two modalities for retrieving the corresponding image. Relevant works consider in questions to match the relationship of visual information, object detection and to match the text with visual information, and employing structural-level representation to align the images with the text. However, these techniques are primarily focused on retrieving the images to text or for the image captioning. But less effort has been spent on retrieving relevant images for the multimodal summary. Hence, our proposed technique extracts and merge features in Hybrid Image Text(HIT) layer and captions in the semantic embeddings with word2vec where the contextual features and semantic relationships are compared and matched with each vector between the modalities, with cosine semantic similarity. In cross-modal retrieval, we achieve top five related images and align the relevant images to the multimodal summary that achieves the highest cosine score among the retrieved images. The model has been trained with seq-to-seq modal with 100 epochs, besides reducing the information loss by the sparse categorical cross entropy. Further, experimenting with the multimodal summarization with multimodal output dataset (MSMO), in cross-modal retrieval, helps to evaluate the quality of image alignment with an image-precision metric that demonstrate the best results.
Chapter
Developing technologies and new social media platforms (Facebook, YouTube, Flickr, and Twitter) are transforming how people communicate with each other and scrutinize information. Different social media users prefer different mode of information for expressing their views. For instance, Twitter users prefer short text and informally taken photos; YouTube viewers tend to discuss breaking news in the form of short videos with rough editing, and professional journalists usually produce standardized news stories with substantial text and well chosen images. Every day, a large number of videos, images, and text are shared on social media platforms. The cross-media data provide credible insights into public opinion, which are extremely beneficial to commercial firms, governments, and any organization concerned with societal opinions. As a result, assessing such voluminous data is beneficial. When cross-media information is combined, it can provide a more comprehensive and effective description of a given event or issue, as well as improve system performance. Also, in addition to being a significant source of information, cross-media information provided by various media is more robust, has a larger audience, and appears to be a more accurate reflection of real-world events. Consequently, robustly identifying topics from multimodal data from multiple media is a useful and sensible extension of cross-media topic identification. This paper presents a review of various cross-media topic detection approaches.KeywordsCross-media topic detectionTopic detection approachesChallengesApplications
Article
Image deletion and image insertion, the current commonly used two kinds of image set management, refer to removing compressed images from and adding new photos to a compressed image set, respectively. However, they do not deal with well the problem of the combination of images selected from multiple compressed image sets. To address this issue, in this paper we first propose an image subset union algorithm for compressed image sets. Image subset union aims to integrate image subsets derived from multiple compressed image sets into a new compressed image set. First, we put forward a way to identify the root vertex candidate of the new compressed image set. Second, we classify all the images of image subsets into three categories: images needed to be re-encoded, images needed to be only decoded, and images unneeded to be re-encoded or decoded. Third, we also employ minimum spanning tree production, a proposed vertex layer candidate assignment method, depth- and subtree-constrained minimum spanning tree generation, as well as image re-encoding to build the new compressed image set. Experimental results show that our proposed algorithm can effectively accomplish image subset union with low complexity.
Article
Image deletion refers to removing images from a compressed image set in cloud servers, which has always received much attention. However, in some cases images are not successfully deleted, and coding performance still remains to rise. In this paper, we propose a low-complexity and high-coding-efficiency image deletion algorithm. First, all the images are classified into to-be-deleted images, images unneeded to be processed, and images needed to be processed further divided into images needed to be only decoded and images needed to be re-encoded. Then, we also propose a depth- and subtree-constrained minimum spanning tree (DSCMST) heuristics to produce the DSCMST of images needed to be processed. Third, every image unneeded to be processed is added to the just obtained DSCMST as the child of the vertex that is still its parent in the compressed image set. Finally, after the encoding of images needed to be re-encoded, a new compressed image set is constructed, implying the completion of image deletion. Experimental results show that under various circumstances our proposed algorithm can effectively remove any images, including root vertex, internal vertices, and leaf vertices. Moreover, compared with state-of-the-art methods, the proposed algorithm achieves higher coding efficiency while having the minimum complexity.
Article
Recent increase in the number of digital photos in the content sharing and social networking websites has created an endless demand for techniques to analyze, navigate, and summarize these images. In this paper, we focus on image collection summarization. Earlier methods in image collection summarization consider representativeness and diversity criteria while recent ones also consider other criteria such as image quality, aesthetic or appeal. In this paper, we propose a multi-criteria context-sensitive approach for social image collection summarization. In the proposed method, two different sets of features are combined while each one looks at different criteria for image collection summarization: social attractiveness features and semantic features. The first feature set considers different aspects that make an image appealing such as image quality, aesthetic, and emotion to create attractiveness score for input images while the second one covers semantic content of images and assigns semantic score to them. We use social network infrastructure to identify attractiveness features and domain ontology for extracting ontology features. The final summarization is provided by integrating the attractiveness and semantic features of input images. The experimental results on a collection of human generated summaries on a set of Flickr images demonstrate the effectiveness of the proposed image collection summarization approach.
Article
mage collection summarisation aims to represent a large-scale multi-modal collection with a small subset of images and tags, helping navigate a large image dataset. Most extant methods leverage the contributions of text-to-visual summaries, ignoring the visual contribution to the textual topic. When the tags are weakly labelled, the textual topic cannot accurately reflect the visual summary. To solve this, the authors propose a novel model, joint optimisation of convex non-negative matrix factorisation, which incorporates images and tags in a beneficial way. The objective function contains visual and textual error functions, sharing the same indicator matrix, connecting different modal relations. Then, they propose an iterative algorithm to optimise the proposed model. Finally, they explore the effects of different visual feature representations (e.g. bag-of-words and deep learning) on multi-modal collection summary. Our proposed method is then compared with state-of-the-art algorithms using two multi-modal datasets (i.e. MIRFlickr and NUS-WIDE-SCENE). Experimental results demonstrate the effectiveness of their proposed approach.
Article
The multimodal optimization problems (MMOPs) need to find multiple optima simultaneously, so the population diversity is a critical issue that should be considered in designing an evolutionary optimization algorithm for MMOPs. Taking advantage of evolutionary multiobjective optimization in maintaining good population diversity, this paper proposes a tri-objective differential evolution (DE) approach to solve MMOPs. Given an MMOP, we first transform it into a tri-objective optimization problem (TOP). The three optimization objectives are constructed based on 1) the objective function of an MMOP, 2) the individual distance information measured by a set of reference points, and 3) the shared fitness based on niching technique. The first two objectives are mutually conflicting so that the advantage of evolutionary multiobjective optimization can be fully used. The population diversity is greatly improved by the third objective constructed by the niching technique which is insensitive to niching parameters. Mathematical proofs are given to demonstrate that the Pareto-optimal front of the TOP contains all global optima of the MMOP. Subsequently, DE-based multiobjective optimization techniques are applied to solve the converted TOP. Moreover, a modified solution comparison criterion and an adaptive ranking strategy for DE are introduced to improve the accuracy of solutions. Experiments have been conducted on 44 benchmark functions to evaluate the performance of the proposed approach. The results show that the proposed approach achieves competitive performance compared with several state-of-the-art multimodal optimization algorithms.
Article
Full-text available
This paper proposes an interactive framework that allows a user to rapidly explore and visualize a large image collection using the medium of average images. Average images have been gaining popularity as means of artistic expression and data visualization, but the creation of compelling examples is a surprisingly laborious and manual process. Our interactive, real-time system provides a way to summarize large amounts of visual data by weighted average(s) of an image collection, with the weights reflecting user-indicated importance. The aim is to capture not just the mean of the distribution, but a set of modes discovered via interactive exploration. We pose this exploration in terms of a user interactively "editing" the average image using various types of strokes, brushes and warps, similar to a normal image editor, with each user interaction providing a new constraint to update the average. New weighted averages can be spawned and edited either individually or jointly. Together, these tools allow the user to simultaneously perform two fundamental operations on visual data: user-guided clustering and user-guided alignment, within the same framework. We show that our system is useful for various computer vision and graphics applications.
Article
Full-text available
Latest results indicate that features learned via convolutional neural networks outperform previous descriptors on classification tasks by a large margin. It has been shown that these networks still work well when they are applied to datasets or recognition tasks different from those they were trained on. However, descriptors like SIFT are not only used in recognition but also for many correspondence problems that rely on descriptor matching. In this paper we compare features from various layers of convolutional neural nets to standard SIFT descriptors. We consider a network that was trained on ImageNet and another one that was trained without supervision. Surprisingly, convolutional neural networks clearly outperform SIFT on descriptor matching.
Conference Paper
Full-text available
Effective management of multimedia data is becoming vital for success in the modern era of omnipresent data. Summarization tools, which allow users to quickly get the gist of a given data collection and have proven their usefulness in text domain, are now gaining popularity also in multimedia processing. However, existing algorithms provide visual-only summaries for image collections, which are difficult to index and search. This paper introduces a prototype software tool that automatically creates multi-modal summaries of personal image collections by enriching the visual collage with keyword annotation. The result is presented as a web page that allows users to browse and share the summarized data.
Conference Paper
Full-text available
In this paper we address the issue of optimizing the actual social photo retrieval technology in terms of users' requirements. Typical users are interested in taking possession of accurately relevant-to-the-query and non-redundant i\-ma\-ges so they can build a correct exhaustive perception over the query. We propose to tackle this issue by combining two approaches previously considered non-overlapping: machine image analysis for a pre-filtering of the initial query results followed by crowd-sourcing for a final refinement. In this mechanism, the machine part plays the role of reducing the time and resource consumption allowing better crowd-sourcing results. The machine technique ensures representativeness in images by performing a re-ranking of all images according to the most common image in the initial noisy set; additionally, diversity is ensured by clustering the images and selecting the best ranked images among the most representative in each cluster. Further, the crowd-sourcing part enforces both representativeness and diversity in images, objectives that are, to a certain extent, out of reach by solely the automated machine technique. The mechanism was validated on more than 25,000 photos retrieved from several common social media platforms, proving the efficiency of this approach.
Article
Full-text available
In recent years there is a growing interest in the study of sparse representation for signals. Using an overcomplete dictionary that contains prototype signal-atoms, signals are described by sparse linear combinations of these atoms. Re- cent activity in this eld concentrated mainly on the study of pursuit algorithms that decompose signals with respect to a given dictionary. In this paper we propose a novel algorithm ñ the K-SVD algorithm ñ generalizing the K-Means clustering process, for adapting dictionaries in order to achieve sparse signal representations. We analyze this algorithm and demonstrate its results on both synthetic tests and in applications on real data.
Article
Full-text available
To address the inability of current ranking systems to support subtopic retrieval, two main post-processing techniques of search results have been investigated: clustering and diversification. In this paper we present a comparative study of their performance, using a set of complementary evaluation measures that can be applied to both partitions and ranked lists, and two specialized test collections focusing on broad and ambiguous queries, respectively. The main finding of our experiments is that diversification of top hits is more useful for quick coverage of distinct subtopics whereas clustering is better for full retrieval of single subtopics, with a better balance in performance achieved through generating multiple subsets of diverse search results. We also found that there is little scope for improvement over the search engine baseline unless we are interested in strict full-subtopic retrieval, and that search results clustering methods do not perform well on queries with low divergence subtopics, mainly due to the difficulty of generating discriminative cluster labels.
Article
Full-text available
We present a novel method for generic visual catego-rization: the problem of identifying the object content of natural images while generalizing across variations inherent to the ob-ject class. This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches. We propose and compare two alternative implementations using different classifiers: Naïve Bayes and SVM. The main advan-tages of the method are that it is simple, computationally effi-cient and intrinsically invariant. We present results for simulta-neously classifying seven semantic visual categories. These re-sults clearly demonstrate that the method is robust to back-ground clutter and produces good categorization accuracy even without exploiting geometric information.
Conference Paper
Full-text available
This paper compares the efficacy and efficiency of different clustering approaches for selecting a set of exemplar images, to present in the context of a semantic concept. We evaluate these approaches using 900 diverse queries, each associated with 1000 web images, and comparing the exemplars chosen by clustering to the top 20 images for that search term. Our results suggest that Affinity Propagation is effective in selecting exemplars that match the top search images but at high computational cost. We improve on these early results using a simple distribution-based selection filter on incomplete clustering results. This improvement allows us to use more computationally efficient approaches to clustering, such as Hierarchical Agglomerative Clustering (HAC) and Partitioning Around Medoids (PAM), while still reaching the same (or better) quality of results as were given by Affinity Propagation in the original study. The computational savings is significant since these alternatives are 7-27 times faster than Affinity Propagation.
Article
Full-text available
The goal of this paper is to describe an exploration system for large image databases in order to help the user understand the database as a whole, discover hidden relationships, and formulate insights with minimum effort. While the proposed system works with any type of low-level feature representation of images, we describe our system using color information. The system is built in three stages: the feature extraction stage in which images are represented in a way that allows efficient storage and retrieval results closer to the human perception; the second stage consists of building a hierarchy of clusters in which the cluster prototype, as the electronic identification (eID) of that cluster, is designed to summarize the cluster in a manner that is suited for quick human comprehension of its components. In a formal definition, an electronic IDentification (eID) is the most similar image to the other images from a corresponding cluster; that is, the image in the cluster that maximizes the sum of the squares of the similarity values to the other images of that cluster. Besides summarizing the image database to a certain level of detail, an eID image will be a way to provide access either to another set of eID images on a lower level of the hierarchy or to a group of perceptually similar images with itself. As a third stage, the multi-dimensional scaling technique is used to provide us with a tool for the visualization of the database at different levels of details. Moreover, it gives the capability for semi-automatic annotation in the sense that the image database is organized in such a way that perceptual similar images are grouped together to form perceptual contexts. As a result, instead of trying to give all possible meanings to an image, the user will interpret and annotate an image in the context in which that image appears, thus dramatically reducing the time taken to annotate large collection of images.
Conference Paper
Full-text available
We consider the problem of clustering Web image search results. Generally, the image search results returned by an image search engine contain multiple topics. Organizing the results into different semantic clusters facilitates users' browsing. In this paper, we propose a hierarchical clustering method using visual, textual and link analysis. By using a vision-based page segmentation algorithm, a web page is partitioned into blocks, and the textual and link information of an image can be accurately extracted from the block containing that image. By using block-level link analysis techniques, an image graph can be constructed. We then apply spectral techniques to find a Euclidean embedding of the images which respects the graph structure. Thus for each image, we have three kinds of representations, i.e. visual feature based representation, textual feature based representation and graph based representation. Using spectral clustering techniques, we can cluster the search results into different semantic clusters. An image search example illustrates the potential of these techniques.
Conference Paper
Full-text available
In the area of image retrieval, post-retrieval processing is often used to refine the retrieval results to better satisfy users' requirements. Previous methods mainly focus on presenting users with relevant results. However, in most cases, users cannot clearly present their requirements by several query words. Therefore, relevant results with rich topic coverage are more likely to meet users' ambiguous needs. In this paper, a re-ranking method based on topic richness analysis is proposed to enrich topic coverage in retrieval results. Furthermore, a quantitative criterion called diversity scores (DS) is proposed to evaluate the improvement. Given a set of images, topics that are rarely included in the set are scarce topics, as oppose to rich topics that are widely distributed among the set. Scarce topics contribute more than rich topics do to the DS of images. Five researchers are invited to evaluate the re-ranked results both in topic coverage and relevance. Experimental results on over 20,000 images demonstrate that our proposed approach is effective in improving the topic coverage of retrieval results without loss of relevance.
Conference Paper
Full-text available
Image clustering, an important technology for image processing, has been actively researched for a long period of time. Especially in recent years, with the explosive growth of the Web, image clustering has even been a critical technology to help users digest the large amount of online visual information. However, as far as we know, many previous works on image clustering only used either low-level visual features or surrounding texts, but rarely exploited these two kinds of information in the same framework. To tackle this problem, we proposed a novel method named consistent bipartite graph co-partitioning in this paper, which can cluster Web images based on the consistent fusion of the information contained in both low-level features and surrounding texts. In particular, we formulated it as a constrained multi-objective optimization problem, which can be efficiently solved by semi-definite programming (SDP). Experiments on a real-world Web image collection showed that our proposed method outperformed the methods only based on low-level features or surround texts.
Conference Paper
Full-text available
The vast majority of the features used in today's commer- cially deployed image search systems employ techniques that are largely indistinguishable from text-document search - the images returned in response to a query are based on the text of the web pages from which they are linked. Unfor- tunately, depending on the query type, the quality of this approach can be inconsistent. Several recent studies have demonstrated the eectiveness of using image features to refine search results. However, it is not clear whether (or how much) image-based approach can generalize to larger samples of web queries. Also, the previously used global features often only capture a small part of the image in- formation, which in many cases does not correspond to the distinctive characteristics of the category. This paper ex- plores the use of local features in the concrete task of find- ing the single canonical images for a collection of commonly searched-for products. Through large-scale user testing, the canonical images found by using only local image features significantly outperformed the top results from Yahoo, Mi- crosoft and Google, highlighting the importance of having these image features as an integral part of future image search engines.
Conference Paper
Full-text available
Large-scale image retrieval on the Web relies on the availability of short snippets of text associated with the image. This user-generated content is a primary source of information about the content and context of an image. While traditional information retrieval models focus on finding the most relevant document without consideration for diversity, image search requires results that are both diverse and relevant. This is problematic for images because they are represented very sparsely by text, and as with all user-generated content the text for a given image can be extremely noisy. The contribution of this paper is twofold. First, we present a retrieval model which provides diverse results as a property of the model itself, rather than in a post-retrieval step. Relevance models offer a unified framework to afford the greatest diversity without harming precision. Second, we show that it is possible to minimize the trade-offs between precision and diversity, and estimating the query model from the distribution of tags favors the dominant sense of a query. Relevance models operating only on tags offers the highest level of diversity with no significant decrease in precision.
Conference Paper
Full-text available
The SIFT keypoint descriptor is a powerful approach to encoding local image description using edge orientation histograms. Through codebook construction via k-means clustering and quantisation of SIFT features we can achieve image retrieval treating images as bags-of-words. Intensity inversion of images results in distinct SIFT features for a single local image patch across the two images. Intensity inversions notwithstanding these two patches are structurally identical. Through careful reordering of the SIFT feature vectors, we can construct the SIFT feature that would have been generated from a non-inverted image patch starting with those extracted from an inverted image patch. Furthermore, through examination of the local feature detection stage, we can estimate whether a given SIFT feature belongs in the space of inverted features, or non-inverted features. Therefore we can consistently separate the space of SIFT features into two distinct subspaces. With this knowledge, we can demonstrate reduced time complexity of codebook construction via clustering by up to a factor of four and also reduce the memory consumption of the clustering algorithms while producing equivalent retrieval results.
Conference Paper
Full-text available
In this paper, we propose a video summarization algorithm by multiple extractions of key frames in each shot. This algorithm is based on the k-medoid clustering algorithms to find the best representative frame for each video shot. This algorithm, which is applicable to all types of descriptors, consists of extracting key frames by similarity clustering according to the given index. In our proposal, the distance between frames is calculated using a fast full search block matching algorithm based on the frequency domain. The proposed approach is computationally tractable and robust with respect to sudden changes in mean intensity within a shot. Additionally, this approach produces different key frames even in the presence of large motion. The experiments results show that our algorithm extracts multiple representatives frames in each video shot without visual redundancy, and thus it is an effective tool for video indexing and retrieval.
Conference Paper
Full-text available
In this work we present topic diversification, a novel method designed to balance and diversify personalized recommendation lists in order to reflect the user's complete spectrum of interests. Though being detrimental to average accuracy, we show that our method improves user satisfaction with recommendation lists, in particular for lists generated using the common item-based collaborative filtering algorithm.Our work builds upon prior research on recommender systems, looking at properties of recommendation lists as entities in their own right rather than specifically focusing on the accuracy of individual recommendations. We introduce the intra-list similarity metric to assess the topical diversity of recommendation lists and the topic diversification approach for decreasing the intra-list similarity. We evaluate our method using book recommendation data, including offline analysis on 361, !, 349 ratings and an online study involving more than 2, !, 100 subjects.
Conference Paper
Full-text available
In this paper, we propose a novel ranking scheme named Affinity Ranking (AR) to re-rank search results by optimizing two metrics: (1) diversity -- which indicates the variance of topics in a group of documents; (2) information richness -- which measures the coverage of a single document to its topic. Both of the two metrics are calculated from a directed link graph named Affinity Graph (AG). AG models the structure of a group of documents based on the asymmetric content similarities between each pair of documents. Experimental results in Yahoo! Directory, ODP Data, and Newsgroup data demonstrate that our proposed ranking algorithm significantly improves the search performance. Specifically, the algorithm achieves 31% improvement in diversity and 12% improvement in information richness relatively within the top 10 search results.
Article
Full-text available
We present several new variations on the theme of nonnegative matrix factorization (NMF). Considering factorizations of the form X=FG(T), we focus on algorithms in which G is restricted to containing nonnegative entries, but allowing the data matrix X to have mixed signs, thus extending the applicable range of NMF methods. We also consider algorithms in which the basis vectors of F are constrained to be convex combinations of the data points. This is used for a kernel extension of NMF. We provide algorithms for computing these new factorizations and we provide supporting theoretical analysis. We also analyze the relationships between our algorithms and clustering algorithms, and consider the implications for sparseness of solutions. Finally, we present experimental results that explore the properties of these new methods.
Article
Full-text available
In this paper, we have developed a novel scheme to incorporate topic network and representativeness-based sampling for achieving semantic and visual summarization and visualization of large-scale collections of Flickr images. First, topic network is automatically generated for summarizing and visualizing large-scale collections of Flickr images at a semantic level, so that users can select more suitable keywords for more precise query formulation. Second, the diverse visual similarities between the semantically-similar images are characterized more precisely by using a mixture-of-kernels and a representativeness-based image sampling algorithm is developed to achieve similarity-based summarization and visualization of large amounts of images under the same topic, so that users can find some particular images of interest more effectively. Our experiments on large-scale image collections with diverse semantics have provided very positive results.
Conference Paper
Full-text available
The recent revolution of digital camera technology has resulted in much larger collections of images. Image browsing techniques thus become increasingly important for overview and retrieval of images in sizable collections. This paper proposes CAT (clustered album thumbnail), a technique for browsing large image collections, and its interface for controlling the level of details (LOD). As a preprocessing, this new system applies tree-structured clustering to images based on their keywords and pixel values, and selects representative images for each cluster. When a user specifies one or multiple keywords, CAT extracts a branch of the tree structure that contains clusters described by the user-specified keywords. A hierarchical data visualization technique is developed to display the tree structured organization of images using nested rectangular regions. Interlocked to the zooming operation, CAT selectively shows representative images while zooming out, or individual images while zooming in.
Article
Full-text available
We define a new distance measure the resistor-average distance between two probability distributions that is closely related to the Kullback-Leibler distance. While the KullbackLeibler distance is asymmetric in the two distributions, the resistor-average distance is not.
Article
Full-text available
Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the ExpectationMaximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.
Article
With the proliferation of digital cameras and mobile devices, people are taking much more photos than ever before. However, these photos can be redundant in content and varied in quality. Therefore there is a growing need for tools to manage the photo collections. One efficient photo management way is photo collection summarization which segments the photo collection into different events and then selects a set of representative and high quality photos (key photos) from those events. However, existing photo collection summarization methods mainly consider the low-level features for photo representation only, such as color, texture, etc, while ignore many other useful features, for example high-level semantic feature and location. Moreover, they often return fixed summarization results which provide little flexibility. In this paper, we propose a multi-modal and multi-scale photo collection summarization method by leveraging multi-modal features, including time, location and high-level semantic features. We first use Gaussian mixture model to segment photo collection into events. With images represented by those multi-modal features, our event segmentation algorithm can generate better performance since the multi-modal features can better capture the inhomogeneous structure of events. Next we propose a novel key photo ranking and selection algorithm to select representative and high quality photos from the events for summarization. Our key photo ranking algorithm takes the importance of both events and photos into consideration. Furthermore, our photo summarization method allows users to control the scale of event segmentation and number of key photos selected. We evaluate our method by extensive experiments on four photo collections. Experimental results demonstrate that our method achieves better performance than previous photo collection summarization methods.
Article
In this paper we propose a novel approach to select a summary set of images from a large image collection by improved Random Sample Consensus (RANSAC) and Affinity Propagation (AP) clustering. It can automatically select a small set of representatives to highlight all the significant visual properties of a given image collection. The proposed framework mainly composes four stages. First, the scale-invariant feature of each image is extracted by Scale Invariant Feature Transform (SIFT). Second, keypoints of two images are matched and ranked based on nearest neighbor ratio. The representative dataset of RANSAC is established by a minimal number of optimal matches. Third, the target homographic matrix is fitted based on the representative dataset. Mismatches are filtered out via the homographic matrix. Finally, summarization is automatically formulated as an optimization framework by AP clustering. We conduct experiments on a set of Paris which is consisting of 1000 images downloaded from Flickr. The results show that the proposed approach significantly outperforms other methods.
Article
The summarization is desirable to efficiently apprehend the gist of the huge amount of data and becomes a significant challenge in many applications such as news article summarization and social media mining. Considering the summaries from multi-documents of one topic can describe various aspects of one given topic, this paper attempts to exploit appropriate priors to generate topic aspect-oriented summarization (abbreviated as TAOS). The underlying intuition of the proposed TAOS is that different topics can prefer different aspects and the different aspects can be represented by different preference of features(e.g., technical topic may prefer proper noun than sports topic). In order to materialize the intuition of TAOS, we first extract several groups of features according to topic factors, and then a group norm penalty (i.e., norm) and latent variables are utilized to select overlapping groups of features. We compare our proposed approach with some state-of-the-art methods on DUC2003, DUC2004 datasets for text summarization and NUS-Wide dataset for image summarization. The results show our method can generate meaningful summarization in terms of ROUGE and Jensen–Shannon Divergence metrics.
Article
This article studies the problem of latent community topic analysis in text-associated graphs. With the development of social media, a lot of user-generated content is available with user networks. Along with rich information in networks, user graphs can be extended with text information associated with nodes. Topic modeling is a classic problem in text mining and it is interesting to discover the latent topics in text-associated graphs. Different from traditional topic modeling methods considering links, we incorporate community discovery into topic analysis in text-associated graphs to guarantee the topical coherence in the communities so that users in the same community are closely linked to each other and share common latent topics. We handle topic modeling and community discovery in the same framework. In our model we separate the concepts of community and topic, so one community can correspond to multiple topics and multiple communities can share the same topic. We compare different methods and perform extensive experiments on two real datasets. The results confirm our hypothesis that topics could help understand community structure, while community structure could help model topics.
Article
The aim of this study was to analyze the radiological features of adolescent idiopathic cervical kyphosis. There are few previous reports about radiographic analysis of cervical sagittal alignment of adolescent idiopathic cervical kyphosis. A new method was proposed in this article to evaluate the severity of cervical kyphosis. 41 adolescent patients with cervical kyphosis were reviewed. Several angles were measured from the radiographs utilizing the two-line Cobb method and Harrison posterior tangent method. Ishihara's Curvature Index (CI), Kyphosis Index (KI), kyphosis levels, and the apex of the kyphosis were also measured. The results showed that the apex of the kyphosis located at the posterior-superior edge of C4 (70. 7%) and C5 (29.3%). C2-C7 angles ranged from 4.7° to 71.3° (36.2±13.6°) and 9.8° to 83.1° (36.4°±15.1°) in above two methods, respectively. Local angles of kyphotic area ranged from 21.8° to 96.3° (50.5°±23.7°) in two-line Cobb method and 19.8° to 105.6° (52.0°±19.5°) in Harrison posterior tangent method. CI and KI ranged from 8.6 to 79.8 (36.8±16.7) and 15.2 to 141.9 (50.6±23.7), respectively. Statistical analysis showed that there was significant positive correlation between KI and kyphosis angle. In adolescent idiopathic cervical kyphosis, the alteration of the sagittal profile only occurs on partial cervical alignment rather than the whole cervical spine. The apex of the kyphosis locates at posterior-superior edge of the vertebrae. It seems that KI can accurately depict the severity of cervical kyphosis.
Article
Modern medical information retrieval systems are paramount to manage the insurmountable quantities of clinical data. These systems empower health care experts in the diagnosis of patients and play an important role in the clinical decision process. However, the ever-growing heterogeneous information generated in medical environments poses several challenges for retrieval systems. We propose a medical information retrieval system with support for multimodal medical case-based retrieval. The system supports medical information discovery by providing multimodal search, through a novel data fusion algorithm, and term suggestions from a medical thesaurus. Our search system compared favorably to other systems in 2013 ImageCLEFMedical.
Article
The explosive growth and wide-spread accessibility of community-contributed multimedia contents on the Internet have led to a surging research activity in social image search. However, the existing tag-based search methods frequently return irrelevant or redundant results. To quickly target user's intention in the result returned by an ambiguous query, we first put forward that the top-ranked search results should meet some criteria, i.e., relevance, typicality and diversity. With the three criteria, a novel ranking scheme for social image search is proposed which incorporates both semantic similarity and visual similarity. The ranking list with relevance, typicality and diversity is returned by optimizing a measure named Average Diverse Precision. The typicality score of samples is estimated via the probability density in the space of visual features. The diversity among the top-ranked list is achieved by fusing both semantic and visual similarities of images. A comprehensive approach for calculating visual similarity is considered by fusing the similarity values according to different features. To further benefit ranking performance, a data-driven method is implemented to refine the tags of social image. Comprehensive experiments demonstrate the effectiveness of the approach proposed in this paper.
Article
A visualization system based on formal concept analysis is proposed, in order to represent lattice structures of large scale image database. In the proposed system, HSB color space that is suitable for human sensitivity, is employed, to draw the lattice structure of images, where the users can easily percept the overview/relationship of them. Moreover, a new image feature considering effect of saturation and brightness, is proposed. The proposed system is developed by using JAVA based program language Processing on the computer (CPU = 2.13 GHz, MM = 2 GB), and a visualization experiment is performed. Through the experiment using 1,000 images selected from Corel Image Gallery, it is confirmed the effectiveness of the proposed system, to grasp the structure in perspective and to discover new relationship.
Article
One of the main issues confronting visualization, is how to effectively display large, high di- mensional datasets within a limited display area, without overwhelming the user. In this report, we discuss a data summarization approach to tackle this problem. Summarization is the pro- cess by which data is reduced in a meaningful and intelligent fashion, to its important and relevant features. We survey several different techniques from within computer science, which can be used to extract various characteristics from raw data. Using summarization techniques intelligently within visualization systems, could potentially reduce the size and dimensionality of large, high dimensional data, highlight relevant and important features, and enhance com- prehension.
Article
In this paper, a novel approach is developed to achieve automatic image collection summarization. The effectiveness of the summary is reflected by its ability to reconstruct the original set or each individual image in the set. We have leveraged the dictionary learning for sparse representation model to construct the summary and to represent the image. Specifically we reformulate the summarization problem into a dictionary learning problem by selecting bases which can be sparsely combined to represent the original image and achieve a minimum global reconstruction error, such as MSE (Mean Square Error). The resulting “Sparse Least Square” problem is NP-hard, thus a simulated annealing algorithm is adopted to learn such dictionary, or image summary, by minimizing the proposed optimization function. A quantitative measurement is defined for assessing the quality of the image summary by investigating both its reconstruction ability and its representativeness of the original image set in large size. We have also compared the performance of our image summarization approach with that of six other baseline summarization tools on multiple image sets (ImageNet, NUS-WIDE-SCENE and Event image set). Our experimental results have shown that the proposed dictionarylearning approach can obtain more accurate results as compared with other six baseline summarization algorithms.
Article
Progresses made on content-based image retrieval have reactivated the research on image analysis and a number of similarity-based methods have been established to assess the similarity between images. In this paper, the content-based approach is extended towards the problem of image collection summarization and comparison. For these purposes we propose to carry out clustering analysis on visual features using self-organizing maps, and then evaluate their similarity using a few dissimilarity measures implemented on the feature maps. The effectiveness of these dissimilarity measures is then examined with an empirical study.
Article
Many image retrieval users are concerned about the diversity of the retrieval results, as well as their relevance. In this paper, we develop a post-processing system, which is based on affinity propagation clustering on manifolds, to improve the diversity of the retrieval results without reduction of their relevance. In order to obtain the top 20 outputs (usually only the top 20 outputs of the retrieval results are shown to users) containing diverse items representing different sub-topics, a modified affinity propagation clustering on manifolds, whose parameters are optimized by minimizing the Davies-Bouldin criterion, is proposed and then performed on the top hundreds of output images of the previous support vector machine (SVM) system. Finally, after getting the clusters, to diversify the top retrieval results, we put the image with the lowest rank in each cluster into the top of the answer list. We test our proposed system on the ImageCLEF PHOTO 2008 task. The experimental results show that our method performs better in enhancing the diversity performance of the image retrieval results, comparing to other diversifying methods such as K-means, Quality Threshold (QT), date clustering (ClusterDMY), and so on. Furthermore, our method does not lead to any loss of the relevance of the retrieval results.
Article
Thousands of images are generated every day, which implies the necessity to classify, organise and access them using an easy, faster and efficient way. Scene classification, the classification of images into semantic categories (e.g. coast, mountains and streets), is a challenging and important problem nowadays. Many different approaches concerning scene classification have been proposed in the last few years. This article presents a detailed review of some of the most commonly used scene classification approaches. Furthermore, the surveyed techniques have been tested and their accuracy evaluated. Comparative results are shown and discussed giving the advantages and disadvantages of each methodology.
Conference Paper
In this paper, we address a problem of managing tagged images with hybrid summarization. We formulate this problem as finding a few image exemplars to represent the image set semantically and visually and solve it in a hybrid way by exploiting both visual and textual information associated with images. We propose a novel approach, called Homogeneous and Heterogeneous Message Propagation (H2MP), which extends affinity propagation that only works over homogeneous relations to heterogeneous relations. The summary obtained by our approach is both visually and semantically satisfactory. The experimental results demonstrate the effectiveness and efficiency of the proposed approach.
Conference Paper
In most well known image retrieval test sets, the imagery typically cannot be freely distributed or is not representative of a large community of users. In this paper we present a collection for the MIR community comprising 25000 images from the Flickr website which are redistributable for research purposes and represent a real community of users both in the image content and image tags. We have extracted the tags and EXIF image metadata, and also make all of these publicly available. In addition we discuss several challenges for benchmarking retrieval and classification methods.
Conference Paper
In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.
Conference Paper
Can we leverage the community-contributed collections of rich media on the web to automatically generate represen- tative and diverse views of the world's landmarks? We use a combination of context- and content-based tools to gener- ate representative sets of images for location-driven features and landmarks, a common search task. To do that, we us- ing location and other metadata, as well as tags associated with images, and the images' visual features. We present an approach to extracting tags that represent landmarks. We show how to use unsupervised methods to extract represen- tative views and images for each landmark. This approach can potentially scale to provide better search and represen- tation for landmarks, worldwide. We evaluate the system in the context of image search using a real-life dataset of 110,000 images from the San Francisco area.
Article
Presenting and browsing image search results play key roles in helping users to find desired images from search results. Most existing commercial image search engines present them, dependent on a ranked list. However, such a scheme suffers from at least two drawbacks: inconvenience for consumers to get an overview of the whole result, and high computation cost to find desired images from the list. In this paper, we introduce a novel search result summarization approach and exploit this approach to further propose an interactive browsing scheme. The main contribution of this paper includes: (1) a dynamic absorbing random walk to find diversified representatives for image search result summarization; (2) a local scaled visual similarity evaluation scheme between two images through inspecting the relation between each image and other images; and (3) an interactive browsing scheme, based on a tree structure for organizing the images obtained from the summarization approach, to enable users to intuitively and conveniently browse the image search results. Quantitative experimental results and user study demonstrate the effectiveness of the proposed summarization and browsing approaches.
Article
In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.
Article
This paper proposes the Flickr Distance (FD) to measure the visual correlation between concepts. For each concept, a collection of related images are obtained from the Flickr website. We assume that each concept consists of several states, e.g., different views, different semantics, etc., which are considered as latent topics. Then a latent topic visual language model (LTVLM) is built to capture these states. The Flickr distance between two concepts is defined as the Jensen-Shannon (J-S) divergence between their LTVLM. Differently from traditional conceptual distance measurements, which are based on Web textual documents, FD is based on the visual information. Comparing with the WordNet distance, FD can easily scale up with the increasing size of the conceptual corpus. Comparing with the Google Distance (NGD) and Tag Concurrence Distance (TCD), FD uses the visual information and can properly measure the conceptual relations. We apply FD to multimedia-related tasks and find methods based on FD significantly outperform those based on NGD and TCD. With the FD measurement, we also construct a large-scale visual conceptual network (VCNet) to store the knowledge of conceptual relationship. Experiments show that FD is more coherent to human cognition and it also outperforms text-based distances in real-world applications.
Article
Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.
Conference Paper
This paper considers the problem of selecting iconic images to summarize general visual categories. We define iconic images as high-quality representatives of a large group of images consistent both in appearance and semantics. To find such groups, we perform joint clustering in the space of global image descriptors and latent topic vectors of tags associated with the images. To select the representative iconic images for the joint clusters, we use a quality ranking learned from a large collection of labeled images. For the purposes of visualization, iconic images are grouped by semantic ldquothemerdquo and multidimensional scaling is used to compute a 2D layout that reflects the relationships between the themes. Results on four large-scale datasets demonstrate the ability of our approach to discover plausible themes and recurring visual motifs for challenging abstract concepts such as ldquoloverdquo and ldquobeautyrdquo.
Conference Paper
We formulate the problem of scene summarization as selecting a set of images that efficiently represents the visual content of a given scene. The ideal summary presents the most interesting and important aspects of the scene with minimal redundancy. We propose a solution to this problem using multi-user image collections from the Internet. Our solution examines the distribution of images in the collection to select a set of canonical views to form the scene summary, using clustering techniques on visual features. The summaries we compute also lend themselves naturally to the browsing of image collections, and can be augmented by analyzing user-specified image tag data. We demonstrate the approach using a collection of images of the city of Rome, showing the ability to automatically decompose the images into separate scenes, and identify canonical views for each scene.
Conference Paper
A lattice structure visualization method by formal concept analysis with respect to huge image databases, is proposed. Furthermore, a summarization method of the huge image databases is proposed based on the obtained lattice structure. The proposed algorithm firstly generates predictive frames from the original frames and divides them to blocks with suitable size. Then, we calculate standard deviation respect to each block, and construct information table, where the objects and attributes correspond to frames and the absolute mean of pixels in the block, respectively. A concept lattice with respect to the information table can be obtained by the formal concept analysis, and it is helpful to understand the overview of the image databases. The obtained lattice includes redundant elements, and the summarized information can be obtained by eliminating the redundant elements of the concept lattice. Through the experiment using the CAVIAR (context aware vision using image-based active recognition) database, it is confirmed the effectiveness of the proposed method.
Article
The advent of large image databases (>10000) has created a need for tools which can search and organize images automatically by their content. This paper focuses on the use of hierarchical tree-structures to both speed-up search-by-query and organize databases for effective browsing. The first part of this paper develops a fast search algorithm based on best-first branch and bound search. This algorithm is designed so that speed and accuracy may be continuously traded-off through the selection of a parameter λ. We find that the algorithm is most effective when used to perform an approximate search, where it can typically reduce computation by a factor of 20-40 for accuracies ranging from 80% to 90%. We then present a method for designing a hierarchical browsing environment which we call a similarity pyramid. The similarity pyramid groups similar images together while allowing users to view the database at varying levels of resolution. We show that the similarity pyramid is best constructed using agglomerative (bottom up) clustering methods, and present a fast sparse clustering method which dramatically reduces both memory and computation over conventional methods