Conference Paper

Leveraging high-level and low-level features for multimedia event detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper addresses the challenge of Multimedia Event Detection by proposing a novel method for high-level and low-level features fusion based on collective classification. Generally, the method consists of three steps: training a classifier from low-level features; encoding high-level features into graphs; and diffusing the scores on the established graph to obtain the final prediction. The final prediction is derived from multiple graphs each of which corresponds to a high-level feature. The paper investigates two graph construction methods using logarithmic and exponential loss functions, respectively and two collective classification algorithms, i.e. Gibbs sampling and Markov random walk. The theoretical analysis demonstrates that the proposed method converges and is computationally scalable and the empirical analysis on TRECVID 2011 Multimedia Event Detection dataset validates its outstanding performance compared to state-of-the-art methods, with an added benefit of interpretability.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This is because conventional optical sensors, such as charge-coupled device (CCD) or complementary metal-oxide semiconductor (CMOS), can only capture the intensity information of the light proportional to the amount of photons. Moreover, measuring the phase of the electromagnetic field at ∼ 10 15 Hz is nearly impossible, as no electronic device can follow such a high frequency easily. ...
... As is presented in [14], a complete image has high redundancy, and removing a substantial proportion of its pixels actually makes it even easier to extract the high-level features, with higher accuracy and fewer iterations in the training process. High level features consider the probability of observing different concepts in the image, which can be easily recognized by humans [15]. Meanwhile, low-level features, or pixel-wise features, which capture the local appearance and texture statistics based on the pixels [15], are critical to the quality of the image reconstruction. ...
... High level features consider the probability of observing different concepts in the image, which can be easily recognized by humans [15]. Meanwhile, low-level features, or pixel-wise features, which capture the local appearance and texture statistics based on the pixels [15], are critical to the quality of the image reconstruction. Besides, Guerrero et al. [16] provide theoretical guarantees for phase retrieval from the coded diffraction patterns with equi-spaced masks, including binary masks. ...
Article
Full-text available
As an important inverse imaging problem in diffraction optics, Fourier phase retrieval aims at estimating the latent image of the target object only from the magnitude of its Fourier measurement. Although in real applications alternating methods are widely-used for Fourier phase retrieval considering the constraints in the object and Fourier domains, they need a lot of initial guesses and iterations to achieve reasonable results. In this paper, we show that a proper sensor mask directly attached to the Fourier magnitude can improve the efficiency of the iterative phase retrieval algorithms, such as alternating direction method of multipliers (ADMM). Furthermore, we refer to the learning-based method to determine the sensor mask according to the Fourier measurement, and unrolled ADMM is used for phase retrieval. Numerical results show that our method outperforms other existing methods for the Fourier phase retrieval problem.
... Generally, these features can be divided into two categories, namely high-level and low-level features. Low-level features capture the local appearance and texture statistics of objects in the video at particular interest points, while high-level features are represented by a real number estimating the probability of observing a concept in the video [19]. Different features characterize different aspects of the multimedia data. ...
... Although high-performance feature descriptors have been developed to help characterize videos, it is still difficult to obtain enough required information with a single feature to discriminate between different kinds of complex events. Therefore, by common consent, combining multiple types of features or video sources is able to achieve better performance [7,18,19,26,[32][33][34]44]. For example, in [7] Chang et al. proposed to investigate the varying contribution of semantic representations from different image/video sources, thus enhancing the exploitation of semantic representation in the source-level. ...
... In [34], Tang et al. present a method which is able to be selective of different subsets of features to combine for certain classes. Jiang et al. use a graph based approach in [19] to diffuse scores among different video data, which makes the fusion result is interpretable for human. In [6] Chang et al. present a multiple feature learning method which embedded feature interaction into a joint framework to capture the nonlinear property within the data while simultaneously combine the linear effect and the nonlinear effect. ...
Article
Full-text available
Multimedia event detection (MED) has become one of the most important visual content analysis tools as the rapid growth of the user generated videos on the Internet. Generally, multimedia data is represented by multiple features and it is difficult to gain better performance for complex event detection with only single feature. However, how to fuse different features effectively is the crucial problem for MED with multiple features. Meanwhile, exploiting multiple features simultaneously in the large-scale scenarios always produces a heavy computational burden. To address these two issues, we propose a self-adaptive multi-feature learning framework with efficient Support Vector Machine (SVM) solver for complex event detection in this paper. Our model is able to utilize multiple features reasonably with an adaptively weighted linear combination manner, which is simple yet effective, according to the various impact that different features on a specific event. In order to mitigate the expensive computational cost, we employ a fast primal SVM solver in the proposed alternating optimization algorithm to obtain the approximate solution with gradient descent method. Extensive experiment results over standard datasets of TRECVID MEDTest 2013 and 2014 demonstrate the effectiveness and superiority of the proposed framework on complex event detection.
... Hakeem et al. [84] SanMiguel [169] Jiang et al. [94] a list of interactions between objects using any prior information concerning the context of a scene a number of human actions, processes, and activities (loosely or tightly organized) having temporal and semantic relationships to the overarching activity "changing a vehicle tire", "making a cake", "attempting a bike trick" Tong et al. [188] Jiang et al. [97] Over et al. [145] a complex activity occurring at a specific place and time involving people interacting with other people or object(s) being any event (something happening at a specific time and place) of interest to the (news) media "demonstration", "speech", "concert" ...
... In [169], SanMiguel et al. define event as a list of interactions between objects, which, along with any other prior information concerning the context of a scene (where the event evolves), are used for the problem of video surveillance. Similarly, in [94], an event is an activity-centered happening that involves people engaged in process-driven actions with other people and/or objects at a specific place and time. Consequently, the above definitions pose the notion of the event at the boundaries between video event detection and vision-based action recognition [154] problems. ...
... Learning from text • MFCC [156] • ZCR [156] • LPC [8] • LPCC [211] • PLP [59] • Spectral Power [59] • HMM [10] • EP-HMM [19] • GMM [37] • BN [22] • SVM [22] • Low-level visual > static [119] > motion [199] • High-level > concept detectors [132] > DCNN-based [213] • Audio features [47] • Textual features [96] • Feature encoding > BoW [26] > FVs [168] > VLAD [163] • Standard SVM [83] • Multimodal fusion > early [186] > late [94] • Temporal Structure exploitation [82] • MKL [193] • Learning-based hashing [222] • Dynamic pooling [115] • Learning from related samples > SVM, RD-SVM, RD-KSVM-iGSU [192] > cross-feature learning [212] > fine-grained labeling [124] Audio Event Detection ...
Article
Full-text available
Research on event-based processing and analysis of media is receiving an increasing attention from the scientific community due to its relevance for an abundance of applications, from consumer video management and video surveillance to lifelogging and social media. Events have the ability to semantically encode relationships of different informational modalities, such as visual-audio-text, time, involved agents and objects, with the spatio-temporal component of events being a key feature for contextual analysis. This unveils an enormous potential for exploiting new information sources and opening new research directions. In this paper, we survey the existing literature in this field. We extensively review the employed conceptualization of the notion of event in multimedia, the techniques for event representation and modeling, the feature representation and event inference approaches for the problems of event detection in audio, visual, and textual content. Furthermore, we review some key event-based multimedia applications, and various benchmarking activities that provide solid frameworks for measuring the performance of different event processing and analysis systems. We provide an in-depth discussion of the insights obtained from reviewing the literature and identify future directions and challenges.
... Many of these recent approaches introduce the use of a bank of concepts representation for this task [8], [4], [10], [11]. Studies [12], [7], [4] have demonstrated that some concepts are more useful than others in recognizing certain events. However, the key question is which concepts are most useful in identifying certain events and how to identify the concepts within the video stream. ...
... However, the key question is which concepts are most useful in identifying certain events and how to identify the concepts within the video stream. Current domain-free approaches learn the event-discriminative concepts from video examples [7], [5], [13]. When a sufficient number of and strong labelled training samples is provided, these learning methods are able to deliver promising results. ...
... It seems that dense trajectory feature [23] is the single best feature, and other visual features complement each other. Current methods for constructing the informative concept bank in favor of different events can be generalized as: (a) automatically select the distinguishing high-level features from video examples [7], [5], [13]; and (b) manually select the related high-level concepts [6], [11]. In the first method, the selections are based solely on the few positive video examples for a complex event, and thus may not conform to humans' broad understanding of the event. ...
Article
The task of recognizing events from video has attracted a lot of attention in recent years. However, due to the complex nature of user-defined events, the use of purely audio- visual content analysis without domain knowledge has been found to be grossly inadequate. In this paper, we propose to construct a semantic-visual knowledge base to encode the rich event-centric concepts and their relationships from the well- established lexical databases, including FrameNet, as well as the concept-specific visual knowledge from ImageNet. Based on this semantic-visual knowledge bases, we design an effective system for video event recognition. Specifically, in order to narrow the semantic gap between the high-level complex events and low-level visual representations, we utilize the event-centric semantic concepts encoded in the knowledge base as the intermediate-level event representation, which offers both human-perceivable and machine-interpretable semantic clues for event recognition. In addition, in order to leverage the abundant ImageNet images, we propose a robust transfer learning model to learn the noise- resistant concept classifiers for videos. Extensive experiments on various real-world video datasets demonstrate the superiority of our proposed system as compared to the state-of-the-art approaches.
... The reranking is inspired by the self-paced learning proposed by Jiang et al. 9) , in that the model is trained iteratively as opposed to simultaneously. Our methods are able to leverage high-level and low-level features which generally leads to increased performance 31) . The high-level features used are ASR, OCR, and semantic visual concepts. ...
... The final results were computed by averaging the initial ranked list with the reranked list. This is beneficial because for the 000Ex case, the initial ranked list is from semantic search (high-level features), whereas the reranked list is from learning-based search (low-level features), and leveraging high-level and low-level features usually yields better performance 31) . To be prudent, the number of iterations is no more than 2 in our final submissions. ...
Article
The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches.
... The reranking is inspired by the self-paced learning proposed by Jiang et al. 9) , in that the model is trained iteratively as opposed to simultaneously. Our methods are able to leverage high-level and low-level features which generally leads to increased performance 31) . The high-level features used are ASR, OCR, and semantic visual concepts. ...
... The final results were computed by averaging the initial ranked list with the reranked list. This is beneficial because for the 000Ex case, the initial ranked list is from semantic search (high-level features), whereas the reranked list is from learning-based search (low-level features), and leveraging high-level and low-level features usually yields better performance 31) . To be prudent, the number of iterations is no more than 2 in our final submissions. ...
Article
Full-text available
The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches.
... Multimedia event detection is an interesting problem. A number of studies have been proposed to tackle this problem on using several training examples (typically 10 or 100 examples) [3,9,11,14,19,35,39,41,43,49]. Generally, in a state-of-the-art system, the event classifiers are trained by low-level and high-level features, and the final decision is derived from the fusion of the individual classification results. ...
... The low-level features are then input into the off-the-shelf detectors to extract the high-level features. Each dimension of the high-level feature corresponds to a confidence score of detecting a semantic concept in the video [13,14]. Compared with low-level features, high-level features have a much lower dimension, which makes them economic for both storage and computation. ...
Article
Full-text available
Semantic search or text-to-video search in video is a novel and challenging problem in information and multimedia retrieval. Existing solutions are mainly limited to text-to-text matching, in which the query words are matched against the user-generated metadata. This kind of text-to-text search, though simple, is of limited functionality as it provides no understanding about the video content. This paper presents a state-of-the-art system for event search without any user-generated metadata or example videos, known as text-to-video search. The system relies on substantial video content understanding and allows for searching complex events over a large collection of videos. The proposed text-to-video search can be used to augment the existing text-to-text search for video. The novelty and practicality are demonstrated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a state-of-the-art system, which may be instrumental in guiding the design of the future system for video search and analysis.
... Most of late fusion strategies focus on classifier-level fusion, which determines a fixed weight for all prediction score of a specific classier. Several recent works [21], [27], [28] determined the sample specific fusion weights. The idea in [21] and [27] is to propagate the information from labeled example to unlabeled one along graphs built on low-level features. ...
... Several recent works [21], [27], [28] determined the sample specific fusion weights. The idea in [21] and [27] is to propagate the information from labeled example to unlabeled one along graphs built on low-level features. Ye et al. [28] converted the prediction scores of multiple classifiers into pairwise relationship matrices, and proposed an optimization framework to determine the fusion score. ...
Article
Full-text available
We present a deep learning strategy to fuse multiple semantic cues for complex event recognition. In particular, we tackle the recognition task by answering how to jointly analyze human actions (who is doing what), objects (what), and scenes (where). First, each type of semantic features (e.g., human action trajectories) is fed into a corresponding multi-layer feature abstraction pathway, followed by a fusion layer connecting all the different pathways. Second, the correlations of how the semantic cues interacting with each other are learned in an unsupervised cross-modality autoencoder fashion. Finally, by fine-tuning a large-margin objective deployed on this deep architecture, we are able to answer the question on how the semantic cues of who, what, and where compose a complex event. As compared with the traditional feature fusion methods (e.g., various early or late strategies), our method jointly learns the essential higher level features that are most effective for fusion and recognition. We perform extensive experiments on two real-world complex event video benchmarks, MED'11 and CCV, and demonstrate that our method outperforms the best published results by 21% and 11%, respectively, on an event recognition task.
... The reranking is inspired by the self-paced learning proposed in [4] that the model is trained iteratively as opposed to simultaneously. Our methods are able to leverage high-level and low-level features which generally lead to an increased performance [14]. The high-level features used are ASR, OCR, and semantic visual concepts. ...
... We did not run PRF for SQ since our 000Ex and SQ runs are very similar. The final run is the average fusion of the original ranked list and the reranked list to leverage high-level and low-level features, which, according to [14], usually yields better performance. To be prudent, the number of iteration is no more than 2 in our final submissions. ...
Conference Paper
Full-text available
We report on our system used in the TRECVID 2014 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. On the MED task, the CMU team achieved leading performance in the Semantic Query (SQ), 000Ex, 010Ex and 100Ex settings. Furthermore, SQ and 000Ex runs are significantly better than the submissions from the other teams. We attribute the good performance to 4 main components: 1) our large-scale semantic concept detectors trained on video shots for SQ/000Ex systems, 2) better features such as improved trajectories and deep learning features for 010Ex/100Ex systems, 3) a novel Multistage Hybrid Late Fusion method for 010Ex/100Ex systems and 4) our developed reranking methods for Pseudo Relevance Feedback for 000Ex/010Ex systems. On the MER task, our system utilizes a subset of features and detection results from the MED system from which the recounting is then generated. Recounting evidence is presented by selecting the most likely concepts detected in the salient shots of a video. Salient shots are detected by searching for shots which have high response when predicted by the video level event detector. On the MED task, the CMU team has enhanced the MED 2013 [1] system in multiple directions, and these improvements have enabled the system to achieve leading performance in the SQ (Semantic Query), 000Ex, 010Ex and 100Ex settings. Furthermore, our system is very efficient in that it can complete Event Query Generation (EQG) in 16 minutes and Event Search (ES) over 200,000 videos in less than 5 minutes on a single workstation. The main improvements are highlighted below: 1. Large-scale semantic concept detectors (for SQ/000Ex systems): Our large-scale semantic video concept detectors, which is 10 times larger than the vocabulary from last year, enabled us to outperform other systems significantly on the SQ and 000Ex settings. The detector training is established based on the self-paced learning theory [2] [3] [4]. 2. CMU improved dense trajectories [5] (for 010Ex/100Ex systems): We enhanced improved trajectories [6] by encoding spatial and time information to model spatial information and temporal invariance. 3. ImageNet deep learning features (for 010Ex/100Ex systems): We have derived 15 different low-level deep learning features [7] from ImageNet [8], and these features have shown to be one of the best low-level features in MED.
... Blitzer et al. [13] used structural correspondence learning to model the correlation between data features in different fields, and utilized key features for discrimination. Jiang et al. [14] further introduced latent space into multi perspective data and extended its features to address the differences caused by multi perspective data. Schematic illustration as shown in Figure 1 (a). ...
Preprint
Full-text available
In order to solve the problem of inconsistent data distribution in machine learning, domain adaptation based on feature representation methods extract features from source domain, and transfer to target domain for classi cation. The existing feature representation based methods mainly solve the problem of inconsistent feature distribution between the source domain data and the target domain data, but only few methods analyze the correlation of cross-domain features between original space and shared latent space, which reduce the performance of domain adaptation. To this end, we propose a domain adaptation method with residual module, the main ideas of which are: (1) transfer the source domain data features to the target domain data through the shared latent space to achieve features sharing; (2) build a cross domain residual learning model using the latent feature space as the residual connection of the original feature space, which improves the propagation e ciency of features; (3) regular feature space to sparse features representation, which can improve the robustness of the model; and (4) give optimization algorithm, and the experiments on the public visual datasets verify the e ectiveness of the method.
... Recently, representing videos using high-level features, such as concept detectors (Snoek and Smeulders 2010), appears promising for the complex event detection task. However, the state-of-the-art concept detector based approaches (Jiang, Hauptmann, and Xiang 2012;Snoek and Smeulders 2010;Ma et al. 2013;Sun and Nevatia 2013) for complex event detection have not considered which concepts should be included in the training concept list. This always conducts the redundancy of concepts Sun and Nevatia 2013) in the concept list for the vocabulary construction. ...
Article
Complex event detection is a retrieval task with the goal of finding videos of a particular event in a large-scale unconstrained internet video archive, given example videos and text descriptions. Nowadays, different multimodal fusion schemes of low-level and high-level features are extensively investigated and evaluated for the complex event detection task. However, how to effectively select the high-level semantic meaningful concepts from a large pool to assist complex event detection is rarely studied in the literature. In this paper, we propose two novel strategies to automatically select semantic meaningful concepts for the event detection task based on both the events-kit text descriptions and the concepts high-level feature descriptions. Moreover, we introduce a novel event oriented dictionary representation based on the selected semantic concepts. Towards this goal, we leverage training samples of selected concepts from the Semantic Indexing (SIN) dataset with a pool of 346 concepts, into a novel supervised multi-task dictionary learning framework. Extensive experimental results on TRECVID Multimedia Event Detection (MED) dataset demonstrate the efficacy of our proposed method.
... A collection of actions performed between agents "A person stops moving left-hand." [28] Hakeem et al. [38] Jiang et al. A list of interactions between objects using any prior information concerning the context of a scene [24] SanMiguel et al. ...
Article
Full-text available
Since 2008, a variety of systems have been designed to detect events in security cameras. There are also more than a hundred journal articles and conference papers published in this field. However, no survey has focused on recognizing events in the surveillance system. Thus, motivated us to provide a comprehensive review of the different developed event detection systems. We start our discussion with the pioneering methods that used the TRECVid-SED dataset and then developed methods using VIRAT dataset in TRECVid evaluation. To better understand the designed systems, we describe the components of each method and the modifications of the existing method separately. We have outlined the significant challenges related to untrimmed security video action detection. Suitable metrics are also presented for assessing the performance of the proposed models. Our study indicated that the majority of researchers classified events into two groups on the basis of the number of participants and the duration of the event for the TRECVid-SED Dataset. Depending on the group of events, one or more models to identify all the events were used. For the VIRAT dataset, object detection models to localize the first stage activities were used throughout the work. Except one study, a 3D convolutional neural network (3D-CNN) to extract Spatio-temporal features or classifying different activities were used. From the review that has been carried, it is possible to conclude that developing an automatic surveillance event detection system requires three factors: accurate and fast object detection in the first stage to localize the activities, and classification model to draw some conclusion from the input values.
... "Birthday Party" and "Parade", solely based on the video content. Since MED is a very challenging task, there have been many studies proposed to tackle this problem in different settings, which includes training detectors using sufficient examples (Wang et al. 2013;Gkalelis and Mezaris 2014;Tong et al. 2014), using only a few examples (Safadi, Sahuguet, and Huet 2014;Jiang et al. 2014b), by exploiting semantic features (Tan, Jiang, and Neo 2014;Liu et al. 2013;Zhang et al. 2014;Inoue and Shinoda 2014;Jiang, Hauptmann, and Xiang 2012;Yu, Jiang, and Hauptmann 2014;Cao et al. 2013), and by automatic speech recognition (Miao, Metze, and Rawat 2013;Miao et al. 2014;Chiu and Rudnicky 2013). ...
Article
Curriculum learning (CL) or self-paced learning (SPL) represents a recently proposed learning regime inspired by the learning process of humans and animals that gradually proceeds from easy to more complex samples in training. The two methods share a similar conceptual learning paradigm, but differ in specific learning schemes. In CL, the curriculum is predetermined by prior knowledge, and remain fixed thereafter. Therefore, this type of method heavily relies on the quality of prior knowledge while ignoring feedback about the learner. In SPL, the curriculum is dynamically determined to adjust to the learning pace of the leaner. However, SPL is unable to deal with prior knowledge, rendering it prone to overfitting. In this paper, we discover the missing link between CL and SPL, and propose a unified framework named self-paced curriculum leaning (SPCL). SPCL is formulated as a concise optimization problem that takes into account both prior knowledge known before training and the learning progress during training. In comparison to human education, SPCL is analogous to "instructor-student-collaborative" learning mode, as opposed to "instructor-driven" in CL or "student-driven" in SPL. Empirically, we show that the advantage of SPCL on two tasks.
... For example, Natarajan et al. (Natarajan et al., 2012) combine a large set of features from different modalities using multiple kernel learning and late score level fusion methods, where the features consists of several low-level features as well as high-level features obtained from object detector responses, automatic speech recognition, and video text recognition. Jiang et al. (Jiang et al., 2012) train a classifier from low-level features, encode highlevel feature of concepts into graphs, and diffuse the scores on the established graph to obtain the final prediction of event. ...
Preprint
Full-text available
Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a "bag" of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms.
... Product Search and Item Retrieval There are a lot of research has been done on product searches by incorporating text into the query, such as [11,15] use the user's feedback to the search query. For the problem of image-based product search Vo et al. [33] proposed a method that regards an image and a text string as a query and allows attribute modification. ...
Preprint
In this paper, we identify and study an important problem of gradient item retrieval. We define the problem as retrieving a sequence of items with a gradual change on a certain attribute, given a reference item and a modification text. For example, after a customer saw a white dress, she/he wants to buy a similar one but more floral on it. The extent of "more floral" is subjective, thus prompting one floral dress is hard to satisfy the customer's needs. A better way is to present a sequence of products with increasingly floral attributes based on the white dress, and allow the customer to select the most satisfactory one from the sequence. Existing item retrieval methods mainly focus on whether the target items appear at the top of the retrieved sequence, but ignore the demand for retrieving a sequence of products with gradual change on a certain attribute. To deal with this problem, we propose a weakly-supervised method that can learn a disentangled item representation from user-item interaction data and ground the semantic meaning of attributes to dimensions of the item representation. Our method takes a reference item and a modification as a query. During inference, we start from the reference item and "walk" along the direction of the modification in the item representation space to retrieve a sequence of items in a gradient manner. We demonstrate our proposed method can achieve disentanglement through weak supervision. Besides, we empirically show that an item sequence retrieved by our method is gradually changed on an indicated attribute and, in the item retrieval task, our method outperforms existing approaches on three different datasets.
... Hybrid fusion, or double fusion, attempts to take advantage of both early and late fusion mechanisms. It has been widely used in the research of multimodal learning, e.g., multimodal speech recognition [39,32] and multimedia event detection [17,40,15]. Besides a bigger picture of fusion strategies, there are many specific fusion strategies in terms of feature-level. ...
Preprint
The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. In this paper, we propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization. Particularly, we present a concise yet valid architecture that effectively learns representations from multiple modalities in a joint manner. Initially, visual features are combined with auditory features and then turned into joint representations. Next, we make use of the joint representations to attend to visual features and auditory features, respectively. With the help of this joint co-attention, new visual and auditory features are produced, and thus both features can enjoy the mutually improved benefits from each other. It is worth noting that the joint co-attention unit is recursive meaning that it can be performed multiple times for obtaining better joint representations progressively. Extensive experiments on the public AVE dataset have shown that the proposed method achieves significantly better results than the state-of-the-art methods.
... The image retrieval process has developed from text to visual content to semantic-based retrieval wherein the idea is to extract the relevant attributes [26], relative attributes [27], and absolute attributes [28] of images and compare the features to retrieve the desired image. ...
Article
Full-text available
For patients with color vision defects, owing to the destruction of cone cells and loss of function, some color information is lost, hence changing the originally transmitted information by the image. The purpose of the traditional method of correction is to help patients distinguish between colors, but it does not take into account the problem of the saliency of the image. In this paper, we propose a saliency consistency-based image re-colorization for color blindness.We use image retrieval methods to select a large number of images and use co-saliency methods to detect salient areas of standard color images and colorblind simulated images. According to the detection results, an image with the same detection result was selected as the reference image. We grayscale the significantly changed image, recolor the grayscale image using the reference image, and the color scheme of the recolored image is similar to the reference image. The color matching scheme of the reference image makes the significance of the image basically unchanged in standard vision and color vision defects, thereby making the color blindness patient’s perception of the image close to standard vision. We invite green blind patients to evaluate CVD simulation images with different recoloring methods subjectively. In addition, we use different evaluation criteria to evaluate the experimental results objectively. In the subjective evaluation and objective evaluation, the method proposed in this paper has achieved good results, which validates the effectiveness of our method.
... Image retrieval: beside traditional text-to-image [32] or image-to-image retrieval [31] task, there are many image retrieval applications with other types of search query such as: sketch [25], scene layout [16], relevance feedback [24,11], product attribute feedback [13,37,9,1], dialog interaction [8] and image text combination query [30]. In this work, the image search query will be a combination of a reference image and a transformation specification. ...
Preprint
With a good image understanding capability, can we manipulate the images high level semantic representation? Such transformation operation can be used to generate or retrieve similar images but with a desired modification (for example changing beach background to street background); similar ability has been demonstrated in zero shot learning, attribute composition and attribute manipulation image search. In this work we show how one can learn transformations with no training examples by learning them on another domain and then transfer to the target domain. This is feasible if: first, transformation training data is more accessible in the other domain and second, both domains share similar semantics such that one can learn transformations in a shared embedding space. We demonstrate this on an image retrieval task where search query is an image, plus an additional transformation specification (for example: search for images similar to this one but background is a street instead of a beach). In one experiment, we transfer transformation from synthesized 2D blobs image to 3D rendered image, and in the other, we transfer from text domain to natural image domain.
... A temporal attention mechanism is then proposed for the representation of complete videos by selecting the most relevant temporal segments through a RNN. Jian et al. [75] rely on a CNN model alongside a RNN to capture spatial and temporal information for event detection in soccer videos. The target events in this work include goals, corners and goal attempts. ...
Article
Full-text available
Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition and detection, deep learning has demonstrated to perform well also in event recognition tasks. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this paper, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this paper, we extensively review different deep learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.
... A lot of research has been done to improve product retrieval performance by incorporating user's feedback to the search query in the form of relevance [31,14], relative [18] or absolute attribute [42,10,1]. Tackling the problem of image based fashion search, Zhao et al. [42] proposed a memory-augmented deep learning system that can perform attribute manipulation. ...
Preprint
Full-text available
In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar but are modified in small ways, such as being taken at nighttime instead of during the day. To tackle this task, we learn a similarity metric between a target image and a source image plus source text, an embedding and composing function such that target image feature is close to the source image plus text composition feature. We propose a new way to combine image and text using such function that is designed for the retrieval task. We show this outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset we create based on CLEVR. We also show that our approach can be used to classify input queries, in addition to image retrieval.
... Since the research focus in content-based image retrieval (CBIR) systems has shifted from leveraging low-level visual features to high-level semantics [17,36], high-level features are now widely used in different multimedia-related applications such as event detection [10]. We determine neighbors of photos using three novel high-level features instead of using low-level visual features exploited in state-of-thearts [14,15,34]. ...
Chapter
Social media platforms such as Flickr allow users to annotate photos with descriptive keywords, called, tags with the goal of making multimedia content easily understandable, searchable, and discoverable. However, due to the manual, ambiguous, and personalized nature of user tagging, many tags of a photo are in a random order and even irrelevant to the visual content. Moreover, manual annotation is very time-consuming and cumbersome for most users. Thus, it is difficult to search and retrieve relevant photos. To this end, we compute relevance scores to predict and rank tags of photos. Specifically, first we present a tag recommendation system, called, PROMPT, that recommends personalized tags for a given photo leveraging personal and social contexts. Specifically, first, we determine a group of users who have similar tagging behavior as the user of the photo, which is very useful in recommending personalized tags. Next, we find candidate tags from visual content, textual metadata, and tags of neighboring photos, and recommends five most suitable tags. We initialize scores of the candidate tags using asymmetric tag co-occurrence probabilities and normalized scores of tags after neighbor voting, and later perform random walk to promote the tags that have many close neighbors and weaken isolated tags. Finally, we recommend top five user tags to the given photo. Next, we present a tag ranking system, called, CRAFT, based on voting from photo neighbors derived from multimodal information. Specifically, we determine photo neighbors leveraging geo, visual, and semantics concepts derived from spatial information, visual content, and textual metadata, respectively. We leverage high-level features instead traditional low-level features to compute tag relevance. Experimental results on the YFCC100M dataset confirm that PROMPT and CRAFT systems outperform their baselines.
... Since the research focus in content-based image retrieval (CBIR) systems has shifted from leveraging low-level visual features to high-level semantics [95,199], high-level features are now widely used in different multimedia-related applications such as event detection [64]. We determine neighbors of UGIs using three novel high-level features instead of using low-level visual features exploited in state-of-the-arts [89,92,195]. ...
Thesis
The amount of user-generated multimedia content (UGC) has increased rapidly in recent years due to the ubiquitous availability of smartphones, digital cameras, and affordable network infrastructures. To benefit users and social media companies from an automatic semantics and sentics understanding of UGC, this thesis focuses on developing effective algorithms for several significant multimedia analytics problems. Sentics are common affective patterns associated with natural language concepts exploited for tasks such as emotion recognition from text/speech or sentiment analysis. Knowledge structures derived from UGC are beneficial in an efficient multimedia search, retrieval, and recommendation. However, real-world UGC is complex, and extracting the semantics and sentics from only multimedia content is very difficult because suitable concepts may be exhibited in different representations. Moreover, due to the increasing popularity of social media sites and advancements in technology, it is possible now to collect significant amount of important contextual information (e.g., spatial, temporal, and preference information). Thus, it necessitates analyzing the information of UGC from multiple modalities to facilitate different social media applications. Specifically, applications related to multimedia summarization, tag ranking and recommendation, preference-aware multimedia recommendation, and multimedia-based e--learning, are built by exploiting the multimedia content (e.g., visual content) and associated contextual information (e.g., geo-, temporal, and other sensory data). However, it is very challenging to address the above-mentioned problems efficiently due to the following reasons: (i) difficulty in capturing the semantics of UGC, (ii) the existence of noisy metadata, (iii) difficulty in handling big datasets, (iv) difficulty in learning user preferences, and (v) the insufficient accessibility and searchability of video content. Exploiting information from multiple sources helps in addressing aforementioned challenges and facilitating different social media applications. Therefore, in this thesis, we leverage information from multiple modalities and fuse the derived knowledge structures to provide effective solutions for several significant multimedia analytics problems. Our research focuses on the semantics and sentics understanding of UGC leveraging both content and contextual information. First, for a better semantics understanding of an event from a large collection of user-generated images (UGIs), we present the EventBuilder system. It enables people to automatically generate a summary for the event in real-time by visualizing different social media such as Wikipedia and Flickr. In particular, we exploit Wikipedia as the event background knowledge to obtain more contextual information about the event. This information is very useful in effective event detection. Next, we solve an optimization problem to produce text summaries for the event. Subsequently, we present the EventSensor system that aims to address sentics understanding and produces a multimedia summary for a given mood. It extracts concepts and mood tags from the visual content and textual metadata of UGIs and exploits them in sentics-based multimedia summary. EventSensor supports sentics-based event summarization by leveraging EventBuilder as its semantics engine component. Moreover, we focus on computing tag relevance for UGIs. Specifically, we leverage personal and social contexts of UGIs and follow a neighbor voting scheme to predict and rank tags. Furthermore, we focus on semantics and sentics understanding from user-generated videos (UGVs). Since many outdoor UGVs lack a certain appeal because their soundtracks consist mostly of ambient background noise, we solve the problem of making UGVs more attractive by recommending a matching soundtrack for a UGV by exploiting content and contextual information. In particular, first, we predict scene moods from a real-world video dataset that was collected from users' daily outdoor activities. Second, we perform heuristic rankings to fuse the predicted confidence scores of multiple models, and third, we customize the video soundtrack recommendation functionality to make it compatible with mobile devices. Furthermore, we address the problem of knowledge structure extraction from educational UGVs to facilitate e–learning. Specifically, we solve the problem of topic-wise segmentation for lecture videos. To extract the structural knowledge of a multi-topic lecture video and thus make it easily accessible, it is very desirable to divide each video into shorter clips by performing an automatic topic-wise video segmentation. However, the accessibility and searchability of most lecture video content are still insufficient due to the unscripted and spontaneous speech of speakers. We present the ATLAS and TRACE systems to automatically perform the temporal segmentation of lecture videos. In our studies, we construct models from visual, transcript, and Wikipedia features to perform such topic-wise segmentations of lecture videos. Moreover, we investigate the late fusion of video segmentation results derived from state-of-the-art methods by exploiting the multimodal information of lecture videos.
... Since the research focus in content-based image retrieval (CBIR) systems has shifted from leveraging low-level visual features to high-level semantics [17,36], high-level features are now widely used in different multimedia-related applications such as event detection [10]. We determine neighbors of photos using three novel high-level features instead of using low-level visual features exploited in state-of-thearts [14,15,34]. ...
Conference Paper
Social media platforms allow users to annotate photos with tags that significantly facilitate an effective semantics understanding, search, and retrieval of photos. However, due to the manual, ambiguous, and personalized nature of user tagging, many tags of a photo are in a random order and even irrelevant to the visual content. Aiming to automatically compute tag relevance for a given photo, we propose a tag ranking scheme based on voting from photo neighbors derived from multimodal information. Specifically, we determine photo neighbors leveraging geo, visual, and semantics concepts derived from spatial information, visual content, and textual metadata, respectively. We leverage high-level features instead traditional low-level features to compute tag relevance. Experimental results on a representative set of 203,840 photos from the YFCC100M dataset confirm that above-mentioned multimodal concepts complement each other in computing tag relevance. Moreover, we explore the fusion of multimodal information to refine tag ranking leveraging recall based weighting. Experimental results on the representative set confirm that the proposed algorithm outperforms state-of-the-arts.
Article
In recent years, image-level weakly supervised semantic segmentation (WSSS) has developed rapidly in natural scenes due to the easy availability of classification tags. However, limited to complex backgrounds, multi-category scenes, and dense small targets in remote sensing (RS) images, relatively little research has been conducted in this field. To alleviate the impact of the above problems in RS scenes, a self-supervised Siamese network based on an explicit pixel-level constraints framework is proposed, which greatly improves the quality of class activation maps and the positioning accuracy in multi-category RS scenes. Specifically, there are three novel devices in this paper to promote performance to a new level: (a) A pixel-soft classification loss is proposed, which realizes explicit constraints on pixels during the image-level training; (b) A pixel global awareness module, which captures high-level semantic context and low-level pixel spatial information, is constructed to improve the consistency and accuracy of RS object segmentation; (c) A dynamic multi-scale fusion module with a gating mechanism is devised, which enhances feature representation and improves the positioning accuracy of RS objects, particularly on small and dense objects. Experiments on two RS challenge datasets demonstrate that these proposed modules achieve new state-of-the-art results by only using image-level labels, which improve mIoU to 36.79% on iSAID and 45.43% on ISPRS in the WSSS task. To the best of our knowledge, this is the first work to perform image-level WSSS on multi-class RS scenes.
Article
Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a “bag” of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms.
Article
Due to the prevalence of social media sites, users are allowed to conveniently share their ideas and activities anytime and anywhere. Therefore, these sites hold substantial real-world event related data. Different from traditional social event detection methods which mainly focus on single-media, multi-modal social event detection aims at discovering events in vast heterogeneous data such as texts, images and video clips. These data denote real-world events from multiple dimensions simultaneously so that they can provide comprehensive and complementary understanding of social event. In recent years, multi-modal social event detection has attracted intensive attentions. This paper concentrates on conducting a comprehensive survey of extant works. Two current attempts in this field are firstly reviewed: event feature learning and event inference. Particularly, event feature learning is a pre-requisite because of its ability on translating social media data into computer-friendly numerical form. Event inference aims at deciding whether a sample belongs to a social event. Then, several public datasets in the community are introduced and the comparison results are also provided. At the end of this paper, a general discussion of the insights is delivered to promote the development of multi-modal social event detection.
Article
Self-paced learning (SPL) is a recently proposed paradigm to imitate the learning process of humans/animals. SPL involves easier samples into training at first and then gradually takes more complex ones into consideration. Current SPL regimes incorporate a self-paced regularizer into the learning objective with a gradually increasing pace parameter. Therefore, it is difficult to obtain the solution path of the SPL regime and determine where to optimally stop this increasing process. In this paper, a multi-objective self-paced learning method is proposed to optimize the loss function and the self-paced regularizer simultaneously. A decomposition-based multi-objective particle swarm optimization algorithm is used to simultaneously optimize the two objectives for obtaining the solutions. In the proposed method, a polynomial soft weighting regularizer is proposed to penalize the loss. Theoretical studies are conducted to show that the previous regularizers are roughly particular cases of the proposed polynomial soft weighting regularizer family. Then an implicit decomposition method is proposed to search the solutions with respect to the sample number involved into training. A set of solutions can be obtained by the proposed method and naturally constitute the solution path of the SPL regime. Then a satisfactory solution can be naturally obtained from these solutions by utilizing some effective tools in evolutionary multi-objective optimization. Experiments on matrix factorization and classification problems demonstrate the effectiveness of the proposed technique.
Chapter
Capturing videos anytime and anywhere, and then instantly sharing them online, has become a very popular activity. However, many outdoor user-generated videos (UGVs) lack a certain appeal because their soundtracks consist mostly of ambient background noise. Aimed at making UGVs more attractive, we introduce ADVISOR, a personalized video soundtrack recommendation system. We propose a fast and effective heuristic ranking approach based on heterogeneous late fusion by jointly considering three aspects: venue categories, visual scene, and user listening history. Specifically, we combine confidence scores, produced by SVMhmm models constructed from geographic, visual, and audio features, to obtain different types of video characteristics. Our contributions are threefold. First, we predict scene moods from a real-world video dataset that was collected from users’ daily outdoor activities. Second, we perform heuristic rankings to fuse the predicted confidence scores of multiple models, and third we customize the video soundtrack recommendation functionality to make it compatible with mobile devices. A series of extensive experiments confirm that our approach performs well and recommends appealing soundtracks for UGVs to enhance the viewing experience.
Chapter
This book studied several significant multimedia analytics problems and presented their solutions leveraging multimodal information. The multimodal information of user-generated multimedia content (UGC) is very useful in an effective search, retrieval, and recommendation services on social media. Specifically, we determine semantics and sentics information from UGC, and leverage them in building improved systems for several significant multimedia analytics problems. We collected and created the significant amount of user-generated multimedia content in our study. To benefit from the multimodal information, we extract knowledge structures from different modalities and exploit them in our solutions for several significant multimedia-based applications. We presented our solution on event understanding from UGIs, tag ranking and recommendation for UGIs, soundtrack recommendation for UGVs, lecture videos segmentation, and news videos uploading in the area with weak network infrastructures leveraging multimodal information. Here we summarize our contributions and future work for several significant multimedia analytics problems.
Chapter
In multimedia-based e-learning systems, the accessibility and searchability of most lecture video content is still insufficient due to the unscripted and spontaneous speech of the speakers. Thus, it is very desirable to enable people to navigate and access specific topics within lecture videos by performing an automatic topic-wise video segmentation. This problem becomes even more challenging when the quality of such lecture videos is not sufficiently high. To this end, we first present the ATLAS system that has two main novelties: (i) a SVMhmm model is proposed to learn temporal transition cues and (ii) a fusion scheme is suggested to combine transition cues extracted from heterogeneous information of lecture videos. Subsequently, considering that contextual information is very useful in determining knowledge structures, we present the TRACE system to automatically perform such a segmentation based on a linguistic approach using Wikipedia texts. TRACE has two main contributions: (i) the extraction of a novel linguistic-based Wikipedia feature to segment lecture videos efficiently, and (ii) the investigation of the late fusion of video segmentation results derived from state-of-the-art algorithms. Specifically for the late fusion, we combine confidence scores produced by the models constructed from visual, transcriptional, and Wikipedia features. According to our experiments on lecture videos from VideoLectures.NET and NPTEL, the proposed algorithms in the ATLAS and TRACE systems segment knowledge structures more accurately compared to existing state-of-the-art algorithms.
Chapter
The rapid growth in the amount of photos/videos online necessitates for social media companies to automatically extract knowledge structures (concepts) from photos and videos to provide diverse multimedia-related services such as event detection and summarization. However, real-world photos and videos aggregated in social media sharing platforms (e.g., Flickr and Instagram) are complex and noisy, and extracting semantics and sentics from the multimedia content alone is a very difficult task because suitable concepts may be exhibited in different representations. Since semantics and sentics knowledge structures are very useful in multimedia search, retrieval, and recommendation, it is desirable to analyze UGCs from multiple modalities for a better understanding. To this end, we first present the EventBuilder system that deals with semantics understanding and automatically generates a multimedia summary for a given event in real-time by leveraging different social media such as Wikipedia and Flickr. Subsequently, we present the EventSensor system that aims to address sentics understanding and produces a multimedia summary for a given mood. It extracts concepts and mood tags from visual content and textual metadata of UGCs, and exploits them in supporting several significant multimedia-related services such as a musical multimedia summary. Moreover, EventSensor supports sentics-based event summarization by leveraging EventBuilder as its semantics engine component. Experimental results confirm that both EventBuilder and EventSensor outperform their baselines and efficiently summarize knowledge structures on the YFCC100M dataset.
Chapter
An interesting recent trend, enabled by the ubiquitous availability of mobile devices, is that regular citizens report events which news providers then disseminate, e.g., CNN iReport. Often such news are captured in places with very weak network infrastructures and it is imperative that a citizen journalist can quickly and reliably upload videos in the face of slow, unstable, and intermittent Internet access. We envision that some middleboxes are deployed to collect these videos over energy-efficient short-range wireless networks. Multiple videos may need to be prioritized, and then optimally transcoded and scheduled. In this study we introduce an adaptive middlebox design, called NEWSMAN, to support citizen journalists. NEWSMAN jointly considers two aspects under varying network conditions: (i) choosing the optimal transcoding parameters, and (ii) determining the uploading schedule for news videos. We design, implement, and evaluate an efficient scheduling algorithm to maximize a user-specified objective function. We conduct a series of experiments using trace-driven simulations, which confirm that our approach is practical and performs well. For instance, NEWSMAN outperforms the existing algorithms (i) by 12 times in terms of system utility (i.e., sum of utilities of all uploaded videos), and (ii) by four times in terms of the number of videos uploaded before their deadline.
Article
Semantic attributes have been increasingly used the past few years for multimedia event detection (MED) with promising results. The motivation is that multimedia events generally consist of lower level components such as objects, scenes, and actions. By characterizing multimedia event videos with semantic attributes, one could exploit more informative cues for improved detection results. Much existing work obtains semantic attributes from images, which may be suboptimal for video analysis since these image-inferred attributes do not carry dynamic information that is essential for videos. To address this issue, we propose to learn semantic attributes from external videos using their semantic labels. We name them video attributes in this paper. In contrast with multimedia event videos, these external videos depict lower level contents such as objects, scenes, and actions. To harness video attributes, we propose an algorithm established on a correlation vector that correlates them to a target event. Consequently, we could incorporate video attributes latently as extra information into the event detector learnt from multimedia event videos in a joint framework. To validate our method, we perform experiments on the real-world large-scale TRECVID MED 2013 and 2014 data sets and compare our method with several state-of-the-art algorithms. The experiments show that our method is advantageous for MED.
Article
In this paper, we propose a nonlinear structural hashing (NSH) approach to learn compact binary codes for scalable video search. Unlike most existing video hashing methods which consider image frames within a video separately for binary code learning, we develop a multi-layer neural network to learn compact and discriminative binary codes by exploiting both the structural information between different frames within a video and the nonlinear relationship between video samples. To be specific, we learn these binary codes under two different constraints at the output of our network: 1) the distance between the learned binary codes for frames within the same scene is minimized, and 2) the distance between the learned binary matrices for a video pair with the same label is less than a threshold and that for a video pair with different labels is larger than a threshold. To better measure the structural information of the scenes from videos, we employ a subspace clustering method to cluster frames into different scenes. Moreover, we design multiple hierarchical nonlinear transformations to preserve the nonlinear relationship between videos. Experimental results on three video datasets show that our method outperforms state-ofthe- art hashing approaches on the scalable video search task.
Conference Paper
The World Wide Web has been witnessing an explosion of video content. Video data are becoming one of the most valuable sources to assess insights and information. However, existing video search methods are still based on text matching (text-to-text search), and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all. In this paper, we propose an accurate, efficient and scalable semantic search method for Internet videos that allows for intelligent and flexible search schemes over the video content (text-to-video search and text&video-to-video search). To achieve this ambitious goal, we propose several novel methods to improve accuracy and efficiency. The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy and efficiency on multiple datasets. Based on the proposed methods, we implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, one of the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search system that is capable of indexing and searching a collection of 100 million videos.
Conference Paper
Large-scale content-based semantic search in video is an interesting and fundamental problem in multimedia analysis and retrieval. Existing methods index a video by the raw concept detection score that is dense and inconsistent, and thus cannot scale to "big data" that are readily available on the Internet. This paper proposes a scalable solution. The key is a novel step called concept adjustment that represents a video by a few salient and consistent concepts that can be efficiently indexed by the modified inverted index. The proposed adjustment model relies on a concise optimization framework with interpretations. The proposed index leverages the text-based inverted index for video retrieval. Experimental results validate the efficacy and the efficiency of the proposed method. The results show that our method can scale up the semantic search while maintaining state-of-the-art search performance. Specifically, the proposed method (with reranking) achieves the best result on the challenging TRECVID Multimedia Event Detection (MED) zero-example task. It only takes 0.2 second on a single CPU core to search a collection of 100 million Internet videos.
Conference Paper
Multimedia event detection (MED) and multimedia event recounting (MER) are fundamental tasks in managing large amounts of unconstrained web videos, and have attracted a lot of attention in recent years. Most existing systems perform MER as a post-processing step on top of the MED results. In order to leverage the mutual benefits of the two tasks, we propose a joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events. Our premise is that a good recounting algorithm should not only explain the detection result, but should also be able to assist detection in the first place. Coupled in a joint optimization framework, recounting improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences. To better utilize the powerful and interpretable semantic video representation, we segment each video into several shots and exploit the rich temporal structures at shot level. The consequent computational challenge is carefully addressed through a significant improvement of the current ADMM algorithm, which, after eliminating all inner loops and equipping novel closed-form solutions for all intermediate steps, enables us to efficiently process extremely large video corpora. We test the proposed method on the large scale TRECVID MEDTest 2014 and MEDTest 2013 datasets, and obtain very promising results for both MED and MER.
Conference Paper
Retrieval of a complex multimedia event has long been regarded as a challenging task. Multimedia event recounting, other than event detection, focuses on providing comprehensible evidence which justifies a detection result. Recounting enables "video skimming", which not only enhances video exploration, but also makes human-in-the-loop possible for improving the detection result. Most existing systems treat event recounting as a disjoint post-processing step over the result of event detection. Unlike these systems, this doctoral research aims to provide an in-depth understanding of how recounting, i.e., evidence localization, helps in event detection in the first place. It can potentially benefit the overall design of an efficient event detection system with or without human-in-the-loop. More importantly, we propose a framework for detecting and recounting everyday events without any needs of training examples. The system only takes a text description of an event as input, then performs evidence localization, event detection and recounting in a large, unlabelled video corpus. The goal of the system is to take advantage of event recounting which eventually improves zero-example event detection. We present preliminary results and work in progress.
Article
Content-based video understanding is extremely difficult due to the semantic gap between low-level vision signals and the various semantic concepts (object, action, and scene) in videos. Though feature extraction from videos has achieved significant progress, most of the previous methods rely only on low-level features, such as the appearance and motion features. Recently, visual-feature extraction has been improved significantly with machine-learning algorithms, especially deep learning. However, there is still not enough work focusing on extracting semantic features from videos directly. The goal of this article is to adopt unlabeled videos with the help of text descriptions to learn an embedding function, which can be used to extract more effective semantic features from videos when only a few labeled samples are available for video recognition. To achieve this goal, we propose a novel embedding convolutional neural network (ECNN). We evaluate our algorithm by comparing its performance on three challenging benchmarks with several popular state-of-the-art methods. Extensive experimental results show that the proposed ECNN consistently and significantly outperforms the existing methods.
Conference Paper
Complex video event detection without visual examples is a very challenging issue in multimedia retrieval. We present a state-of-the-art framework for event search without any need of exemplar videos and textual metadata in search corpus. To perform event search given only query words, the core of our framework is a large, pre-built bank of concept detectors which can understand the content of a video in the perspective of object, scene, action and activity concepts. Leveraging such knowledge can effectively narrow the semantic gap between textual query and the visual content of videos. Besides the large concept bank, this paper focuses on two challenges that largely affect the retrieval performance when the size of the concept bank increases: (1) How to choose the right concepts in the concept bank to accurately represent the query; (2) if noisy concepts are inevitably chosen, how to minimize their influence. We share our novel insights on these particular problems, which paves the way for a practical system that achieves the best performance in NIST TRECVID 2015.
Article
Multimedia event detection has been one of the major endeavors in video event analysis. A variety of approaches have been proposed recently to tackle this problem. Among others, using semantic representation has been accredited for its promising performance and desirable ability for human-understandable reasoning. To generate semantic representation, we usually utilize several external image/video archives and apply the concept detectors trained on them to the event videos. Due to the intrinsic difference of these archives, the resulted representation is presumable to have different predicting capabilities for a certain event. Notwithstanding, not much work is available for assessing the efficacy of semantic representation from the source-level. On the other hand, it is plausible to perceive that some concepts are noisy for detecting a specific event. Motivated by these two shortcomings, we propose a bi-level semantic representation analyzing method. Regarding source-level, our method learns weights of semantic representation attained from different multimedia archives. Meanwhile, it restrains the negative influence of noisy or irrelevant concepts in the overall concept-level. In addition, we particularly focus on efficient multimedia event detection with few positive examples, which is highly appreciated in the real-world scenario. We perform extensive experiments on the challenging TRECVID MED 2013 and 2014 datasets with encouraging results that validate the efficacy of our proposed approach.
Article
Recent advances in sparse representation show that overcomplete dictionaries learned from natural images can capture high-level features for image analysis. Since atoms in the dictionaries are typically edge patterns and image blur is characterized by the spread of edges, an overcomplete dictionary can be used to measure the extent of blur. Motivated by this, this paper presents a no-reference sparse representation-based image sharpness index. An overcomplete dictionary is first learned using natural images. The blurred image is then represented using the dictionary in a block manner, and block energy is computed using the sparse coefficients. The sharpness score is defined as the variance-normalized energy over a set of selected high-variance blocks, which is achieved by normalizing the total block energy using the sum of block variances. The proposed method is not sensitive to training images, so a universal dictionary can be used to evaluate the sharpness of images. Experiments on six public image quality databases demonstrate the advantages of the proposed method.
Conference Paper
In this paper, we propose velocity pyramid for multimedia event detection. Recently, spatial pyramid matching is proposed to introduce coarse geometric information into Bag of Features framework, and is effective for static image recognition and detection. In video, not only spatial information but also temporal information, which represents its dynamic nature, is important. In order to fully utilize it, we propose velocity pyramid where video frames are divided into motional sub-regions. Our method is effective for detecting events characterized by their temporal patterns. Experiment on the dataset of MED (Multimedia Event Detection) has shown 10% improvement of performance by velocity pyramid than without this method. Further, when combined with spatial pyramid, velocity pyramid provides an extra 3% gains to the detection result.
Conference Paper
Video analysis has been attracting increasing research due to the proliferation of internet videos. In this paper, we investigate how to improve the performance on internet quality video analysis. Particularly, we work on the scenario of few labeled training videos being provided, which is less focused in multimedia. To being with, we consider how to more effectively harness the evidences from the low-level features. Researchers have developed several promising features to represent videos to capture the semantic information. However, as videos usually characterize rich semantic contents, the analysis performance by using one single feature is potentially limited. Simply combining multiple features through early fusion or late fusion to incorporate more informative cues is doable but not optimal due to the heterogeneity and different predicting capability of these features. For better exploitation of multiple features, we propose to mine the importance of different features and cast it into the learning of the classification model. Our method is based on multiple graphs from different features and uses the Riemannian metric to evaluate the feature importance. On the other hand, to be able to use limited labeled training videos for a respectable accuracy we formulate our method in a semi-supervised way. The main contribution of this paper is a novel scheme of evaluating the feature importance that is further casted into a unified framework of harnessing multiple weighted features with limited labeled training videos. We perform extensive experiments on video action recognition and multimedia event recognition and the comparison to other state-of-the-art multi-feature learning algorithms has validated the efficacy of our framework.
Conference Paper
Semantic search in video is a novel and challenging problem in information and multimedia retrieval. Existing solutions are mainly limited to text matching, in which the query words are matched against the textual metadata generated by users. This paper presents a state-of-the-art system for event search without any textual metadata or example videos. The system relies on substantial video content understanding and allows for semantic search over a large collection of videos. The novelty and practicality is demonstrated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a stateof-the-art system, which may be instrumental in guiding the design of the future system for semantic search in video.
Conference Paper
Full-text available
This paper studies the eect of Latent Semantic Analysis (LSA) on two dierent tasks: multimedia document retrieval (MDR) and automatic image annotation (AIA). The contri- butions of this paper are twofold. First, to the best of our knowledge, this work is the first study of the influence of LSA on the retrieval of a significant number of multimedia documents (i.e. collection of 20000 tourist images). Second, it shows how dierent image representations (region-based and keypoint-based) can be combined by LSA to improve au- tomatic image annotation. The document collections used for these experiments are the Corel photo collection and Im- ageCLEF 2006 collection.
Conference Paper
Full-text available
Collective classification can significantly improve accuracy by exploiting relationships among instances. Although several collective inference procedures have been reported, they have not been thoroughly evaluated for their commonalities and differences. We introduce novel generalizations of three existing algorithms that allow such algorithmic and empirical comparisons. Our generalizations permit us to examine how cautiously or aggressively each algorithm exploits intermediate relational data, which can be noisy. We conjecture that cautious approaches that identify and preferentially exploit the more reliable intermediate data should outperform aggressive approaches. We explain why caution is useful and introduce three parameters to control the degree of caution. An empirical evaluation of collective classification algorithms, using two base classifiers on three data sets, supports our conjecture.
Conference Paper
Full-text available
The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.
Conference Paper
Full-text available
Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, we consider two classes of fusion schemes, namely early fusion and late fusion. The former fuses modalities in feature space, the latter fuses modalities in semantic space. We show by experiment on 184 hours of broadcast video data and for 20 semantic concepts, that late fusion tends to give slightly better performance for most concepts. However, for those concepts where early fusion performs better the difference is more significant.
Conference Paper
Full-text available
In retrieval, indexing and classiflcation of multimedia data an e-cient information fusion of the difierent modalities is essential for the system's overall performance. Since information fusion, its in∞uence factors and performance improvement boundaries have been lively dis- cussed in the last years in difierent research communities, we will review their latest flndings. They most importantly point out that exploiting the feature's and modality's dependencies will yield to maximal perfor- mance. In data analysis and fusion tests with annotated image collections this is undermined.
Conference Paper
Full-text available
Late fusion of independent retrieval methods is the simpler approach and a widely used one for combining visual and textual information for the search process. Usually each retrieval method is based on a single modality, or even, when several methods are considered per modality, all of them use the same information for indexing/querying. The latter reduces the diversity and complementariness of documents considered for the fusion, as a consequence the performance of the fusion approach is poor. In this paper we study the combination of multiple het- erogeneous methods for image retrieval in annotated collec- tions. Heterogeneousness is considered in terms of i) the modality in which the methods are based on, ii) in the in- formation they use for indexing/querying and iii) in the in- dividual performance of the methods. Dierent settings for the fusion are considered including weighted, global, per- modality and hierarchical. We report experimental results, in an image retrieval benchmark, that show that the pro- posed combination outperforms signicantly any of the indi- vidual methods we consider. Retrieval performance is com- parable to the best performance obtained in the context of ImageCLEF2007. An interesting result is that even meth- ods that perform poor (individually) resulted very useful to the fusion strategy. Furthermore, opposed to work reported in the literature, better results were obtained by assigning This work was supported by CONACyT under project grant 61335.
Conference Paper
Full-text available
56 pages - TRECVID workshop notebook papers/slides available at http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html
Article
Full-text available
Numerous real-world applications produce networked data such as web data (hypertext documents connected via hyperlinks) and communication networks (peo- ple connected via communication links). A recent focus in machine learning re- search has been to extend traditional machine learning classification techniques to classify nodes in such data. In this report, we attempt to provide a brief intro- duction to this area of research and how it has progressed during the past decade. We introduce four of the most widely used inference algorithms for classifying networked data and empirically compare them on both synthetic and real-world data.
Article
Full-text available
This paper presents NetKit, a modular toolkit for classification in networked data, and a case-studyof its application to a collection of networked data sets used in prior machine learning research.Networked data are relational data where entities are interconnected, and this paper considers thecommon case where entities whose labels are to be estimated are linked to entities for which thelabel is known. NetKit is based on a three-component framework, comprising a local classifier, arelational classifier, and a collective inference procedure. Various existing relational learning algorithmscan be instantiated with appropriate choices for these three components and new relationallearning algorithms can be composed by new combinations of components. The case study demonstrateshow the toolkit facilitates comparison of different learning methods (which so far has beenlacking in machine learning research). It also shows how the modular framework allows analysisof subcomponents, to assess which, whether, and when particular components contribute to superiorperformance. The case study focuses on the simple but important special case of univariatenetwork classification, for which the only information available is the structure of class linkage inthe network (i.e., only links and some class labels are available). To our knowledge, no work previouslyhas evaluated systematically the power of class-linkage alone for classification in machinelearning benchmark data sets. The results demonstrate clearly that simple network-classificationmodels perform remarkably wellâ€"well enough that they should be used regularly as baseline classifiersfor studies of relational learning for networked data. The results also show that there are asmall number of component combinations that excel, and that different components are preferablein different situations, for example when few versus many labels are known.
Conference Paper
Full-text available
Local image features or interest points provide compact and abstract representations of patterns in an image. We propose to extend the notion of spatial interest points into the spatio-temporal domain and show how the resulting features often reflect interesting events that can be used for a compact representation of video data as well as for its interpretation. To detect spatio-temporal events, we build on the idea of the Harris and Forstner interest point operators and detect local structures in space-time where the image values have significant local variations in both space and time. We then estimate the spatio-temporal extents of the detected events and compute their scale-invariant spatio-temporal descriptors. Using such descriptors, we classify events and construct video representation in terms of labeled space-time points. For the problem of human motion analysis, we illustrate how the proposed method allows for detection of walking people in scenes with occlusions and dynamic backgrounds.
Article
Full-text available
Local imagef eatures or interest points provide compact and abstract representationsof patterns in an image. In this paper, we extend the notionof spatial interest points into the spatio-temporal domain and show how the resultingfu tures capture interesting events in video and can be usedf or a compact representation andf or interpretation of video data.
Article
Ensemble classification methods that independently construct component models (e.g., bagging) improve accuracy over single models by reducing the error due to variance. Some work has been done to extend ensemble techniques for classification in relational domains by taking relational data characteristics or multiple link types into account during model construction. However, since these approaches follow the conventional approach to ensemble learning, they improve performance by reducing the error due to variance in learning. We note however, that variance in inference can be an additional source of error in relational methods that use collective classification, since inferred values are propagated during inference. We propose a novel ensemble mechanism for collective classification that reduces both learning and inference variance, by incorporating prediction averaging into the collective inference process itself. We show that our proposed method significantly outperforms a straightforward relational ensemble baseline on both synthetic and real-world datasets.
Article
Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions. Marksmen side by side firing simultaneous shots at targets, so that the deviations are in part due to independent individual errors and in part to common causes such as wind, provide a familiar introduction to the theory of correlation; but only the correlation of the horizontal components is ordinarily discussed, whereas the complex consisting of horizontal and vertical deviations may be even more interesting. The wind at two places may be compared, using both components of the velocity in each place. A fluctuating vector is thus matched at each moment with another fluctuating vector. The study of individual differences in mental and physical traits calls for a detailed study of the relations between sets of correlated variates. For example the scores on a number of mental tests may be compared with physical measurements on the same persons. The questions then arise of determining the number and nature of the independent relations of mind and body shown by these data to exist, and of extracting from the multiplicity of correlations in the system suitable characterizations of these independent relations. As another example, the inheritance of intelligence in rats might be studied by applying not one but s different mental tests to N mothers and to a daughter of each
Conference Paper
Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modalities has been identified, how do we best fuse them to map to semantics? In this paper, we propose a two-step approach. The first step finds statistically independent modalities from raw features. In the second step, we use super-kernel fusion to determine the optimal combination of individual modalities. We carefully analyze the tradeoffs between three design factors that affect fusion performance: modality independence, curse of dimensionality, and fusion-model complexity. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a careful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.
Conference Paper
The objective of this paper is classifying images by the object categories they contain, for example motorbikes or dolphins. There are three areas of novelty. First, we introduce a descriptor that represents local image shape and its spatial layout, together with a spatial pyramid kernel. These are designed so that the shape correspondence between two images can be measured by the distance between their descriptors using the kernel. Second, we generalize the spatial pyramid kernel, and learn its level weighting parameters (on a validation set). This significantly improves classification performance. Third, we show that shape and appearance kernels may be combined (again by learning parameters on a validation set). Results are reported for classification on Caltech-101 and retrieval on the TRECVID 2006 data sets. For Caltech-101 it is shown that the class specific optimization that we introduce exceeds the state of the art performance by more than 10%.
Conference Paper
This paper introduces a method for scene categorization by modeling ambiguity in the popular codebook approach. The codebook approach describes an image as a bag of discrete visual codewords, where the frequency distributions of these words are used for image categorization. There are two drawbacks to the traditional codebook model: codeword uncertainty and codeword plausibility. Both of these drawbacks stem from the hard assignment of visual features to a single codeword. We show that allowing a degree of ambiguity in assigning codewords improves categorization performance for three state-of-the-art datasets.
Conference Paper
Ensemble classification methods that independently construct component models (e.g., bagging) improve accuracy over single models by reducing the error due to variance. Some work has been done to extend ensemble techniques for classification in relational domains by taking relational data characteristics or multiple link types into account during model construction. However, since these approaches follow the conventional approach to ensemble learning, they improve performance by reducing the error due to variance in learning. We note however, that variance in inference can be an additional source of error in relational methods that use collective classification, since inferred values are propagated during inference. We propose a novel ensemble mechanism for collective classification that reduces both learning and inference variance, by incorporating prediction averaging into the collective inference process itself. We show that our proposed method significantly outperforms a straightforward relational ensemble baseline on both synthetic and real-world datasets.
Conference Paper
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.