Conference Paper

Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the increasing use of audio sensors in user generated content (UGC) collection, semantic concept annotation using audio streams has become an important research problem. Huawei initiates a grand challenge in the International Conference on Multimedia & Expo (ICME) 2014: Huawei Accurate and Fast Mobile Video Annotation Challenge. In this paper, we present our semantic concept annotation system using audio stream only for the Huawei challenge. The system extracts audio stream from the video data and low-level acoustic features from the audio stream. Bag-of-feature representation is generated based on the low-level features and is used as input feature to train the support vector machine (SVM) concept classifier. The experimental results show that our audio-only concept annotation system can detect semantic concepts significantly better than random guess. It can also provide important complementary information to the visual-based concept annotation system for performance boost.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Automatically categorizing videos into concepts, such as people actions, events, etc., is an important topic. Recently many studies have been proposed to tackle the problem of concept learning [2,33,70,80,99,111]. ...
Preprint
Full-text available
With the advancement in computer vision deep learning, systems now are able to analyze an unprecedented amount of rich visual information from videos to enable applications such as autonomous driving, socially-aware robot assistant and public safety monitoring. Deciphering human behaviors to predict their future paths/trajectories and what they would do from videos is important in these applications. However, modern vision systems in self-driving applications usually perform the detection (perception) and prediction task in separate components, which leads to error propagation and sub-optimal performance. More importantly, these systems do not provide high-level semantic attributes to reason about pedestrian future. This design hinders prediction performance in video data from diverse domains and unseen scenarios. To enable optimal future human behavioral forecasting, it is crucial for the system to be able to detect and analyze human activities leading up to the prediction period, passing informative features to the subsequent prediction module for context understanding. In this thesis, with the goal of improving the performance and generalization ability of future trajectory and action prediction models, we conduct human action analysis and jointly optimize models for action detection, prediction and trajectory prediction.
... In this paper, we build the shooter localization system to solve this problem, which utilizes machine learning models like video synchronization [12] for event reconstruction [1,8,11] and gunshot detection [15,19,20] to help analysts quickly find relevant video segments to look at. we present how the our system can geo-localize the shooter given a few video recordings that only captures the gun shot sound, as shown in Figure 1. ...
Conference Paper
Full-text available
Nowadays a huge number of user-generated videos are uploaded to social media every second, capturing glimpses of events all over the world. These videos provide important and useful information for reconstructing events like the Las Vegas Shooting in 2017. In this paper, we describe a system that can localize the shooter location only based on a couple of user-generated videos that capture the gunshot sound. Our system first utilizes established video analysis techniques like video synchronization and gunshot temporal localization to organize the unstructured social media videos for users to understand the event effectively. By combining multimodal information from visual, audio and geo-locations, our system can then visualize all possible locations of the shooter in the map. Our system provides a web interface for human-in-the-loop verification to ensure accurate estimations. We present the results of estimating the shooter's location of the Las Vegas Shooting in 2017 and show that our system is able to get accurate location using only the first few gunshots. The full technical report, all relevant source code including the web interface and machine learning models are available.
... As we see, the overlapping area of all estimations points to the shooter's actual location -the north wing of the Mandalay Bay Hotel. semantic concept detection in videos [1,2,3,4,5,6,7], video captioning [8,9], intelligent question answering [10,11,12], 3D event reconstruction [13,14] and activity prediction [15]; but it may also help our society achieve a more unbiased understanding of the event truth [16,17]. ...
Preprint
Full-text available
Nowadays a huge number of user-generated videos are uploaded to social media every second, capturing glimpses of events all over the world. These videos provide important and useful information for reconstructing the events. In this paper, we describe the DAISY system, enabled by established machine learning techniques and physics models, that can localize the shooter location only based on a couple of user-generated videos that capture the gun shot sound. The DAISY system utilizes machine learning techniques like video synchronization and gunshot temporal localization to organize the unstructured social media videos and quickly localize gunshot in the videos. It provides a web interface for human-in-the-loop verification to ensure accurate estimations. We present the results of estimating the shooter's location of the Las Vegas Shooting in 2017 and show that DAISY is able to get accurate location using only the first few shots. We then point out future directions that can help improve the system and further reduces human labor in the process. We publish all relevant source code including the web interface and machine learning models in the hope that such tool can be of use to help preserve life and get contributions from the research and software engineering community to make the tool better.
Conference Paper
Full-text available
Nowadays a huge number of user-generated videosare uploaded to social media every second, capturing glimpsesof events all over the world. These videos in the wild provideimportant and useful information for reconstructing events likethe Las Vegas Shooting in 2017. In this paper, we describe asystem that can localize the shooter location only based on acouple of user-generated videos that capture the gunshot sound.Our system first utilizes established video analysis techniques likevideo synchronization and automatic gunshot processing to orga-nize the unstructured videos in the wild for users to understandthe event effectively. By combining multimodal information fromvisual, audio and geo-locations, our system can then visualizeall possible locations of the shooter in the map. Our systemprovides a web interface for human-in-the-loop verification toensure accurate estimations. We present the results of estimatingthe shooter’s location of the Las Vegas Shooting in 2017 andshow that our system is able to get accurate location using onlythe first few gunshots.
Conference Paper
Learning video concept detectors automatically from the big but noisy web data with no additional manual annotations is a novel but challenging area in the multimedia and the machine learning community. A considerable amount of videos on the web is associated with rich but noisy contextual information, such as the title and other multi-modal information, which provides weak annotations or labels about the video content. To tackle the problem of large-scale noisy learning, We propose a novel method called Multi-modal WEbly-Labeled Learning (WELL-MM), which is established on the state-of-the-art machine learning algorithm inspired by the learning process of human. WELL-MM introduces a novel multi-modal approach to incorporate meaningful prior knowledge called curriculum from the noisy web videos. We empirically study the curriculum constructed from the multi-modal features of the Internet videos and images. The comprehensive experimental results on FCVID and YFCC100M demonstrate that WELL-MM outperforms state-of-the-art studies by a statically significant margin on learning concepts from noisy web video data. In addition, the results also verify that WELL-MM is robust to the level of noisiness in the video data. Notably, WELL-MM trained on sufficient noisy web labels is able to achieve a better accuracy to supervised learning methods trained on the clean manually labeled data.
Conference Paper
With the increasing use of audio sensors in user generated content (UGC) collections, semantic concept annotation from video soundtracks has become an important research problem. In this paper, we investigate reducing the semantic gap of the traditional data-driven bag-of-audio-words based audio annotation approach by utilizing the large-amount of wild audio data and their rich user tags, from which we propose a new feature representation based on semantic class model distance. We conduct experiments on the data collection from HUAWEI Accurate and Fast Mobile Video Annotation Grand Challenge 2014. We also fuse the audio-only annotation system with a visual-only system. The experimental results show that our audio-only concept annotation system can detect semantic concepts significantly better than does random guessing. The new feature representation achieves comparable annotation performance with the bag-of-audio-words feature. In addition, it can provide more semantic interpretation in the output. The experimental results also prove that the audio-only system can provide significant complementary information to the visual-only concept annotation system for performance boost and for better interpretation of semantic concepts both visually and acoustically.
Article
Full-text available
Learning classifiers for many visual concepts are im-portant for image categorization and retrieval. As a classifier tends to misclassify negative examples which are visually similar to posi-tive ones, inclusion of such misclassified and thus relevant negatives should be stressed during learning. User-tagged images are abun-dant online, but which images are the relevant negatives remains unclear. Sampling negatives at random is the de facto standard in the literature. In this paper, we go beyond random sampling by proposing Negative Bootstrap. Given a visual concept and a few positive examples, the new algorithm iteratively finds relevant neg-atives. Per iteration, we learn from a small proportion of many user-tagged images, yielding an ensemble of meta classifiers. For efficient classification, we introduce Model Compression such that the classification time is independent of the ensemble size. Com-pared with the state of the art, we obtain relative gains of 14% and 18% on two present-day benchmarks in terms of mean average precision. For concept search in one million images, model com-pression reduces the search time from over 20 h to approximately 6 min. The effectiveness and efficiency, without the need of manu-ally labeling any negatives, make negative bootstrap appealing for learning better visual concept classifiers.
Article
Full-text available
Most multimedia surveillance and monitoring systems nowadays utilize multiple types of sensors to detect events of interest as and when they occur in the environment. However, due to the asynchrony among and diversity of sensors, information assimilation – how to combine the information obtained from asynchronous and multifarious sources is an important and challenging research problem. In this paper, we propose a framework for information assimilation that addresses the issues – “when”, “what” and “how” to assimilate the information obtained from different media sources in order to detect events in multimedia surveillance systems. The proposed framework adopts a hierarchical probabilistic assimilation approach to detect atomic and compound events. To detect an event, our framework uses not only the media streams available at the current instant but it also utilizes their two important properties – first, accumulated past history of whether they have been providing concurring or contradictory evidences, and – second, the system designer’s confidence in them. The experimental results show the utility of the proposed framework.
Article
Full-text available
Video indexing, also called video concept detection, has attracted increasing attentions from both academia and industry. To reduce human labeling cost, active learning has been introduced to video indexing recently. In this paper, we propose a novel active learning approach based on the optimum experimental design criteria in statistics. Different from existing optimum experimental design, our approach simultaneously exploits sample's local structure, and sample relevance, density, and diversity information, as well as makes use of labeled and unlabeled data. Specifically, we develop a local learning model to exploit the local structure of each sample. Our assumption is that for each sample, its label can be well estimated based on its neighbors. By globally aligning the local models from all the samples, we obtain a local learning regularizer, based on which a local learning regularized least square model is proposed. Finally, a unified sample selection approach is developed for interactive video indexing, which takes into account the sample relevance, density and diversity information, and sample efficacy in minimizing the parameter variance of the proposed local learning regularized least square model. We compare the performance between our approach and the state-of-the-art approaches on the TREC video retrieval evaluation (TRECVID) benchmark. We report superior performance from the proposed approach.
Conference Paper
Full-text available
Most existing systems for content-based video retrieval (CBVR) are now amenable to support automatic low-level video content analysis and feature extraction, but they have limited effectiveness from a user's perspective. To support semantic video retrieval via keywords, we have proposed a novel framework by incorporating the concept ontology to enable more effective modeling and representation of semantic video concepts. Specifically, this novel framework includes: (a) Using the salient objects to achieve a middle-level understanding of the semantics of video contents; (b) Building a domain dependent concept ontology to enable multi-level modeling and representation of semantic video concepts; (c) Developing a multi-task boosting technique to achieve hierarchical video classifier training for automatic multi-level video annotation. The experimental results in a certain domain of medical education videos are also provided.
Article
Full-text available
In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.
Article
Full-text available
The acoustic environment provides a rich source of information on the types of activity, communication modes, and people involved in many situations. It can be accurately classified using recordings from microphones commonly found in PDAs and other consumer devices. We describe a prototype HMM-based acoustic environment classifier incorporating an adaptive learning mechanism and a hierarchical classification model. Experimental results show that we can accurately classify a wide variety of everyday environments. We also show good results classifying single sounds, although classification accuracy is influenced by the granularity of the classification.
Article
Full-text available
The TREC Video Retrieval Evaluation (TRECVID) 2009 was a TREC-style video analysis and retrieval evaluation, the goal of which was to promote progress in content-based exploitation of digital video via open, metrics-based evaluation. Over the last 9 years TRECVID has yielded a better understanding of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. 63 teams from various research organizations — 28 from Europe, 24 from Asia, 10 from North America, and 1 from Africa — completed one or more of four tasks: high-level feature extraction, search (fully automatic, manually assisted, or interactive), copy detection, or surveillance event detection. This paper gives an overview of the tasks, data used, evaluation mechanisms and performance
Article
Full-text available
As increasingly powerful techniques emerge for machine tagging multimedia content, it becomes ever more important to standardize the underlying vocabularies. Doing so provides interoperability and lets the multimedia community focus ongoing research on a well-defined set of semantics. This paper describes a collaborative effort of multimedia researchers, library scientists, and end users to develop a large standardized taxonomy for describing broadcast news video. The large-scale concept ontology for multimedia (LSCOM) is the first of its kind designed to simultaneously optimize utility to facilitate end-user access, cover a large semantic space, make automated extraction feasible, and increase observability in diverse broadcast news video data sets
Article
The IBM Research/Columbia team investigated a novel range of low-level and high-level features and their combination for the TRECVID Multimedia Event Detection (MED) task. We submitted four runs exploring various methods of extraction, modeling and fusing of low-level features and hundreds of high-level semantic concepts. Our Run 1 developed event detection models utilizing Support Vector Machines (SVMs) trained from a large number of low-level features and was interesting in establishing the baseline performance for visual features from static video frames. Run 2 trained SVMs from classification scores generated by 780 visual, 113 action and 56 audio high-level semantic classifiers and explored various temporal aggregation techniques. Run 2 was interesting in assessing performance based on different kinds of high-level semantic information. Run 3 fused the lowand high-level feature information and was interesting in providing insight into the complementarity of this information for detecting events. Run 4 fused all of these methods and explored a novel Scene Alignment Model (SAM) algorithm that utilized temporal information discretized by scene changes in the video.
Article
This paper presents a novel method for automatically classifying consumer video clips based on their soundtracks. We use a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator labeling, and sufficiency of representation in available video collections. A set of 1873 videos from real users has been annotated with these concepts. Starting with a basic representation of each video clip as a sequence of mel-frequency cepstral coefficient (MFCC) frames, we experiment with three clip-level representations: single Gaussian modeling, Gaussian mixture modeling, and probabilistic latent semantic analysis of a Gaussian component histogram. Using such summary features, we produce support vector machine (SVM) classifiers based on the Kullback-Leibler, Bhattacharyya, or Mahalanobis distance measures. Quantitative evaluation shows that our approaches are effective for detecting interesting concepts in a large collection of real-world consumer video clips.
Article
Text categorization is the task of assigning predefined categories to natural language text. With the widely used 'bag of words' representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tf idf style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.
Conference Paper
We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 features intended to measure conceptually distinct properties of speech and/or music signals, and combined them in several multidimensional classification frameworks. We provide extensive data on system performance and the cross-validated training/test setup used to evaluate the system. For the datasets currently in use, the best classifier classifies with 5.8% error on a frame-by-frame basis, and 1.4% error when integrating long (2.4 second) segments of sound
Conference Paper
Straightforward classification using kernelized SVMs re- quires evaluating the kernel for a test vector and each of the support vectors. For a class of kernels we show that one can do this much more efficiently. In particular we showthat one canbuild histogramintersection kernel SVMs (IKSVMs)with runtimecomplexityofthe classifierlogarith- mic in the number of support vectors as opposed tolinear for the standardapproach. We further show that by precom- puting auxiliary tables we can construct an approximate classifier with constant runtime and space requirements, independent of the number of support vectors, with negli- gible loss in classification accuracy on various tasks. This approximation also applies to 1 − �2 and other kernels of similar form. We also introduce novel features based on a multi-level histogramsof oriented edgeenergy andpresent experiments on various detection datasets. On the INRIA pedestrian dataset an approximate IKSVM classifier based on these features has the current best performance, with a miss rate 13% lower at 10 6 False Positive Per Window than the linear SVM detector of Dalal & Triggs. On the Daimler Chrysler pedestrian dataset IKSVM gives comparable ac- curacy to the best results (based on quadratic SVM), while being 15× faster. In these experiments our approximate IKSVM is up to2000× faster than a standard implementa- tion and requires200× less memory. Finally we show that a 50× speedup is possible using approximate IKSVM based on spatial pyramid features on the Caltech 101 dataset with negligible loss of accuracy.
In this paper, we present a large-scale object retrieval system. The user supplies a query object by selecting a region of a query image, and the system returns a ranked list of images that contain the same object, retrieved from a large corpus. We demonstrate the scalability and performance of our system on a dataset of over 1 million images crawled from the photo-sharing site, Flickr [3], using Oxford landmarks as queries. Building an image-feature vocabulary is a major time and performance bottleneck, due to the size of our dataset. To address this problem we compare different scalable methods for building a vocabulary and introduce a novel quantization method based on randomized trees which we show outperforms the current state-of-the-art on an extensive ground-truth. Our experiments show that the quantization has a major effect on retrieval quality. To further improve query performance, we add an efficient spatial verification stage to re-rank the results returned from our bag-of-words model and show that this consistently improves search quality, though by less of a margin when the visual vocabulary is large. We view this work as a promising step towards much larger, "web-scale" image corpora.
Conference Paper
In this paper we present a systematic study of automatic classification of consumer videos into a large set of diverse semantic concept classes, which have been carefully selected based on user studies and extensively annotated over 1300+ videos from real users. Our goals are to assess the state of the art of multimedia analytics (including both audio and visual analysis) in consumer video classification and to discover new research opportunities. We investigated several statistical approaches built upon global/local visual features, audio features, and audio-visual combinations. Three multi-modal fusion frameworks (ensemble, context fusion, and joint boosting) are also evaluated. Experiment results show that visual and audio models perform best for different sets of concepts. Both provide significant contributions to multimodal fusion, via expansion of the classifier pool for context fusion and the feature bases for feature sharing. The fused multimodal models are shown to significantly reduce the detection errors (compared to single modality models), resulting in a promising accuracy of 83% over diverse concepts. To the best of our knowledge, this is the first work on systematic investigation of multimodal classification using a large-scale ontology and realistic video corpus.
Conference Paper
A multi-layered, hierarchical framework for generic sports event analysis has been proposed in this paper. Each layer uses different features such as short-time audio energy, short-time Zero Crossing Rate, color histogram, fractal dimension, motion, etc and classifiers such as Hidden Markov Model, threshold-based classifier, etc. We demonstrate our approach on sports video. The automated semantic concept extraction is ideally required for application such as highlight generation, indexing and retrieval.
Conference Paper
In previous research of text categorization, a word is usually described by features which express that whether the word appears in the document or how frequently the word appears. Although these features are useful, they have not fully expressed the information contained in the document. In this paper, the distributional features are used to describe a word, which express the distribution of a word in a document. In detail, the compactness of the appearances of the word and the position of the first appearance of the word are characterized as features. These features are exploited by a TFIDF style equation in this paper. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency features solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved.
Conference Paper
We describe a technique which is successful at discriminating speech from music on broadcast FM radio. The computational simplicity of the approach could lend itself to wide application including the ability to automatically change channels when commercials appear. The algorithm provides the capability to robustly distinguish the two classes and runs easily in real time. Experimental results to date show performance approaching 98% correct classification
Article
The aim of this paper is to investigate the feasibility of an audio-based context recognition system. Here, context recognition refers to the automatic classification of the context or an environment around a device. A system is developed and compared to the accuracy of human listeners in the same task. Particular emphasis is placed on the computational complexity of the methods, since the application is of particular interest in resource-constrained portable devices. Simplistic low-dimensional feature vectors are evaluated against more standard spectral features. Using discriminative training, competitive recognition accuracies are achieved with very low-order hidden Markov models (1-3 Gaussian components). Slight improvement in recognition accuracy is observed when linear data-driven feature transformations are applied to mel-cepstral features. The recognition rate of the system as a function of the test sequence length appears to converge only after about 30 to 60 s. Some degree of accuracy can be achieved even with less than 1-s test sequence lengths. The average reaction time of the human listeners was 14 s, i.e., somewhat smaller, but of the same order as that of the system. The average recognition accuracy of the system was 58% against 69%, obtained in the listening tests in recognizing between 24 everyday contexts. The accuracies in recognizing six high-level classes were 82% for the system and 88% for the subjects.
Article
Many audio and multimedia applications would benefit from the ability to classify and search for audio based on its characteristics. The audio analysis, search, and classification engine described here reduces sounds to perceptual and acoustical features. This lets users search or retrieve sounds by any one feature or a combination of them, by specifying previously learned classes based on these features, or by selecting or entering reference sounds and asking the engine to retrieve similar or dissimilar sounds
Article
A hybrid connectionist-HMM speech recognizer uses a neural network acoustic classifier. This network estimates the posterior probability that the acoustic feature vectors at the current time step should be labelled as each of around 50 phone classes. We sought to exploit informal observations of the distinctions in this posterior domain between nonspeech audio and speech segments well-modeled by the network. We describe four statistics that successfully capture these differences, and which can be combined to make a reliable speech/nonspeech categorization that is closely related to the likely performance of the speech recognizer. We test these features on a database of speech/music examples, and our results match the previously-reported classification error, based on a variety of special-purpose features, of 1.4% for 2.5 second segments. We also show that recognizing segments ordered according to their resemblance to clean speech can result in an error rate close to the ideal minimum o...
Categorizing Consumer Videos Using Audio
  • Q Jin
  • F Schulam
  • S Rawat
  • S Burger
  • D Ding
  • F Metze
Concept-based Video Retrieval. Foundations and Trends in Information Retrieval
  • C Snoek
  • M Worring