Conference Paper

Expressive semantics for automatic annotation and retrieval of video streams

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

For the practical use of video retrieval systems, access must be provided so as to bridge the semantic gap between the system and users. Semiotics, which is concerned with the production of sense and of the way in which it is received by humans appears as the formal background for this goal. According to semiotics, semantics can be extracted at different levels of signification through a suitable set of rules which combine visual and auditory signs. An intermediate semantic level, which has to do with the combination of low-level signs, such as color, motion, and shape and their changes through time, accounts for the video expressiveness and the emotions provoked. It can be obtained automatically and is often useful for retrieval by content and similarity-based classification of video streams. Applications to retrieval of commercials based on expressive semantics are discussed

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The extracted features can be combined and weighted in different ways resulting into logical (advanced) features, which represent the image content on a higher abstraction level. New works in this area consider the image semantics and try to define and integrate different kinds of emotions [5]. The similarity degree of a query image and the target images is determined by calculation of a distance (for example Euclidian Distance L 2 ) between the corresponding features. ...
Conference Paper
Full-text available
... The segmented shots are annotated for subsequent searches (Ito, Sato, & Fukumura, 2000; Wilcox & Boreczky, 1998) and for automatically summarizing the title (Jin & Hauptmann, 2001). Based on the semantic structure analysis, various semantic indexing models or schemes have been proposed (Del Bimbo, 2000; Gao, Ko, & De Silva, 2000; Gargi, Antani, & Kasturi, 1998; Iyengar, Nock, Neti, & Franz, 2002; Jasinschi et al., 2001a; Jasinschi et al., 2002; Luo & Hwang, 2003; Naphade & Huang, 2000; Naphade, Kristjansson, Frey, & Huang, 1998). Most of them employ probabilistic models, for example, Hidden Markov Models (HMMs) and/or Dynamic Bayesian networks (DBNs), to represent the video structure. ...
Article
Full-text available
The rapid technical advance of multimedia communication has enabled more and more people to enjoy videoconferences. Traditionally, the personal videoconference is either not recorded or only recorded as ordinary audio and video files that only allow linear access. Moreover, in addition to video and audio channels, other videoconferencing channels, including text chat, file transfer, and whiteboard, also contain valuable information. Therefore, it is not convenient to search or recall the content of videoconference from the archives. However, there exists little research on the management and automatic indexing of personal videoconferences. The existing methods for video indexing, lecture indexing, and meeting support systems cannot be applied to personal videoconference straightforwardly. This chapter discusses important issues unique to personal videoconference and proposes a comprehensive framework for indexing personal videoconference. The framework consists of three modules: videoconference archive acquisition module, videoconference archive indexing module, and indexed videoconference accessing module. This chapter will elaborate on the design principles and implementation methodologies of each module, as well as the
... Chang et al. [8] presented semantic visual templates using examples of sunsets and high-jumpers concepts. Del Bimbo [9] has introduced detection scheme of four semiotic classes: practical, playful, utopic and critical, measuring expressiveness of commercial videos and utilized them in retrieval of advertisement videos. IBM has developed statistical models for over ten concepts [10] for their Video TREC retrieval system. ...
Conference Paper
Full-text available
In this paper we describe new methods to detect semantic concepts from digital video based on audible and visual content. Temporal Gradient Cor- relogram captures temporal correlations of gradient edge directions from sam- pled shot frames. Power-related physical features are extracted from short audio samples in video shots. Video shots containing people, cityscape, landscape, speech or instrumental sound are detected with trained self-organized maps and kNN classification results of audio samples. Test runs and evaluations in TREC 2002 Video Track show consistent performance for Temporal Gradient Correlo- gram and state-of-the-art precision in audio-based instrumental sound detection.
... Naphade et al. [6] have proposed a framework of probabilistic multimedia objects for modelling audiovisual semantics from low-level features and demonstrated it with sky, water, forest, rocks and snow concepts [7]. Other similar approaches have been reported in [8][9]. Naphade's multimedia objects have been incorporated into multimedia retrieval system of IBM [10]. ...
Conference Paper
Full-text available
This paper describes revised content-based search experiments in the context of TRECVID 2003 benchmark. Experiments focus on measuring content-based video retrieval performance with following search cues: visual features, semantic concepts and text. The fusion of features uses weights and similarity ranks. Visual similarity is computed using Temporal Gradient Correlogram and Temporal Color Correlogram features that are extracted from the dynamic content of a video shot. Automatic speech recognition transcripts and concept detectors enable higher-level semantic searching. 60 hours of news videos from TRECVID 2003 search task were used in the experiments. System performance was evaluated with 25 pre-defined search topics using average precision. In visual search, multiple examples improved the results over single example search. Weighted fusion of text, concept and visual features improved the performance over text search baseline. Expanded query term list of text queries gave also notable increase in performance over the baseline text search
... New works in this area consider the image semantics and try to define and integrate different kinds of emotions (e.g. [8]). The similarity degree of a query image and the target images is determined by calculation of a distance (for example Euclidian Distance L 2 ) between the corresponding features. ...
Conference Paper
This paper presents an overview over parallel architectures for the efficient realisation of digital libraries by considering image databases as an example. The state of the art approach for image retrieval uses a priori extracted features and limits the applicability of the retrieval techniques, as a detail search for objects and for other important elements can't be performed. Well-suited algorithms for dynamic feature extraction and comparison are not often applied, as they require huge computational and memory resources. Integration of parallel methods and architectures enables the use of these alternative approaches for improved classification and retrieval of documents in digital libraries. Therefore implemented prototypes on a symmetric multiprocessor (SMP) and on cluster architecture are introduced in the paper. Performance measurements with a wavelet-based template matching method resulted into a reasonable speedup.
... 4,5 The question arises how supervision and automation (supervised learning) of such categorisation systems can come about on the basis of textual annotations made by an ensemble of human experts using a specific domain ontology. 25 Note that the ensemble of multimedia objects including annotations also possesses a natural statistics and according geometry and topology that is not yet imposed as contextual information or knowledge on multimedia consistent scale-space schemes. This natural statistics and geometry actually influences a human expert in defining a specific domain ontology. ...
Conference Paper
Full-text available
Static multimedia on the Web can already be hardly structured manually. Although unavoidable and necessary, manual annotation of dynamic multimedia becomes even less feasible when multimedia quickly changes in complexity, i.e. in volume, modality, and usage context. The latter context could be set by learning or other purposes of the multimedia material. This multimedia dynamics calls for categorisation systems that index, query and retrieve multimedia objects on the fly in a similar way as a human expert would. We present and demonstrate such a supervised dynamic multimedia object categorisation system. Our categorisation system comes about by continuously gauging it to a group of human experts who annotate raw multimedia for a certain domain ontology given a usage context. Thus effectively our system learns the categorisation behaviour of human experts. By inducing supervised multi-modal content and context-dependent potentials our categorisation system associates field strengths of raw dynamic multimedia object categorisations with those human experts would assign. After a sufficient long period of supervised machine learning we arrive at automated robust and discriminative multimedia categorisation. We demonstrate the usefulness and effectiveness of our multimedia categorisation system in retrieving semantically meaningful soccer-video fragments, in particular by taking advantage of multimodal and domain specific information and knowledge supplied by human experts.
... It is rather a complex instantiation of static and dynamic elements emerging from relations within the system: database record itself, temporal context, user's algorithms. In [6], [7] several different low-level visual primitives are combined together by domainspecific rules in order to capture semantics of video content at a higher level of significance. ...
Article
Full-text available
A generic system for automatic annotation of videos is introduced. The proposed approach is based on the premise that the rules needed to infer a set of high-level concepts from low-level descriptors cannot be defined a priori. Rather, knowledge embedded in the database and interaction with an expert user is exploited to enable system learning. Underpinning the system at the implementation level is preannotated data that dynamically creates signification links between a set of low-level features extracted directly from the video dataset and high-level semantic concepts defined in the lexicon. The lexicon may consist of words, icons, or any set of symbols that convey the meaning to the user. Thus, the lexicon is contingent on the user, application, time, and the entire context of the annotation process. The main system modules use fuzzy logic and rule mining techniques to approximate human-like reasoning. A rule-knowledge base is created on a small sample selected by the expert user during the learning phase. Using this rule-knowledge base, the system automatically assigns keywords from the lexicon to nonannotated video clips in the database. Using common low-level video representations, the system performance was assessed on a database containing hundreds of broadcasting videos. The experimental evaluation showed robust and high annotation accuracy. The system architecture offers straightforward expansion to relevance feedback and autonomous learning capabilities.
... New works in this area consider the image semantics and try to define and integrate different kinds of emotions (e.g. [8]). The similarity degree of a query image and the target images is determined by calculation of a distance (for example Euclidian Distance L 2 ) between the corresponding features. ...
Article
: This paper presents an overview over parallel architectures for the efficient realisation of digital libraries by considering image databases as an example. The state of the art approach for image retrieval uses a priori extracted features and limits the applicability of the retrieval techniques, as a detail search for objects and for other important elements can't be performed. Well-suited algorithms for dynamic feature extraction and comparison are not often applied, as they require huge computational and memory resources. Integration of parallel methods and architectures enables the use of these alternative approaches for improved classification and retrieval of documents in digital libraries. Therefore implemented prototypes on a symmetric multiprocessor (SMP) and on cluster architecture are introduced in the paper. Performance measurements with a wavelet-based template matching method resulted into a reasonable speedup. 1.
Article
Multimedia information retrieval has become an important issue in library work, particularly since users can have access to the Internet both at their homes and in libraries. Besides bibliographic information, pictorial resources are nowadays requested to meet the new and demanding users' needs. Research on different aspects of multimedia information retrieval can improve practice in different ways and findings can be beneficial to many professionals: cataloguers, indexers, reference librarians, Web designers, chief librarians, and administrators. This paper presents a critical account of the developments in multimedia information retrieval research and practice in Italy in the last ten years. The paper also tries to determine if dissemination of research findings has been critical in establishing new approaches to practice. The study examined the LIS sector, analysing projects through relevant publications and implementations, as well as computer science and other disciplines, to trace future developments and trends.
Conference Paper
Full-text available
This study describes experiments on automatic detection of semantic concepts, which are textual descriptions about the digital video content. The concepts can be further used in content-based categorization and access of digital video repositories. Temporal gradient correlograms, temporal color correlograms and motion activity low-level features are extracted from the dynamic visual content of a video shot. Semantic concepts are detected with an expeditious method that is based on the selection of small positive example sets and computational low-level feature similarities between video shots. Detectors using several feature and fusion operator configurations are tested in 60-hour news video database from TRECVID 2003 benchmark. Results show that the feature fusion based on ranked lists gives better detection performance than fusion of normalized low-level feature spaces distances. Best performance was obtained by pre-validating the configurations of features and rank fusion operators. Results also show that minimum rank fusion of temporal color and structure provides comparable performance
Article
In this paper, we examine emerging frontiers in the evolution of content-based retrieval systems that rely on an intelligent infrastructure. Here, we refer to intelligence as the capabilities of the systems to build and maintain situational or world models, utilize dynamic knowledge representations, exploit context, and leverage advanced reasoning and learning capabilities. We argue that these elements are essential to producing effective systems for retrieving audio-visual content at semantic levels matching those of human perception and cognition. In this paper, we review relevant research on the understanding of human intelligence and construction of intelligent systems in the fields of cognitive psychology, artificial intelligence, semiotics, and computer vision. We also discuss how some of the principal ideas from these fields lead to new opportunities and capabilities for content-based retrieval systems. Finally, we describe some of our efforts in these directions. In particular, we present MediaNet, a multimedia knowledge presentation framework, and some MPEG-7 description tools that facilitate and enable intelligent content-based retrieval.
Article
Full-text available
A compositional approach increases the level of representation that can be automatically extracted and used in a visual information retrieval system. Visual information at the perceptual level is aggregated according to a set of rules. These rules reflect the specific context and transform perceptual words into phrases capturing pictorial content at a higher, and closer to the human, semantic level
Article
Visual database systems require efficient indexing to facilitate fast access to the images and video sequences in the database. Recently, several content-based indexing methods for image and video based on spatial relationships, color, texture, shape, sketch, object motion, and camera parameters have been reported in the literature. The goal of this paper is to provide a critical survey of existing literature on content-based indexing techniques and to point out the relative advantages and disadvantages of each approach.
Conference Paper
Digital video is rapidly becoming important for education, entertainment, and a host of multimedia applications. With the size of the video collections growing to thousands of hours, technology is needed to effectively browse seg ments in a short time without losing the content of the video. We propose a method to extract the significant audio and video information and create a "skim" video which represents a very short synopsis of the original. The goal of this work is to show the utility of integrating lan guage and image understanding techniques for video skimming by extraction of significant information, such as specific objects, audio keywords and relevant video struc ture. The resulting skim video is much shorter, where com paction is as high as 20:1, and yet retains the essential content of the original segment.
Article
This paper reviews a number of recently available techniques incontent37420 of visual media and their application to the indexing,retrieval,92-36020 relevance assessment, interactive perception, annotation and re-use ofvisual-346 ments. 1. Background A few years ago, the problems of representation and retrieval ofvisual-276 were confined to specialized image databases (geographical, medical, pilotexperiments6 computerized slide libraries), in the professional applications of theaudiovisual20 -24 (production, broadcasting and archives), and in computerized training or education. Thepresent-23420 2 of multimedia technology and information highways has put content processing ofvisual-220 at the core of key application domains: digital and interactive video, large distributed digital libraries, multimedia publishing. Though the most important investments have been targeted at the information infrastructure (networks, servers, coding and compression,deliveryn, 356 multimedia systems arc...
Article
This paper presents a work based on semiotic studies that includes the extraction of simple visual features from commercials and a statistical analysis of them and their relationships with high-level semantic terms. Well-known algorithms have been implemented and enhanced for feature extraction, as well as a novel probabilistic approach to color naming. The statistical analysis consists of finding correlations between variables, as well as the dimensions in feature space that best explain the variance of the data set. Some interesting conclusions are reached at the end of the work about how commercials are grouped in feature space with respect to different levels of semantics. # 2000 Academic Press 1. Introduction THE LARGE AMOUNT OF INFORMATION that is being compiled in the everyday faster growing number of digital video libraries that can be found all over the world has directed the efforts of many researchers to the developm
Conference Paper
Effective retrieval and browsing by content of videos is based on high level information associated with visual data. Automatic extraction of high level content descriptors, requires to exploit the semantical characteristics of video types. In this paper a complete system for content based annotation and retrieval of news videos is presented, indexing of the video stream is fully automated and is based both on visual features extracted from video shots and on textual strings extracted from captions and speech
Conference Paper
Content based browsing and navigation in digital video collections have been centered on sequential and linear presentation of images. To facilitate such applications, nonlinear and non sequential access into video documents is essential, especially with long programs. For many programs, this can be achieved by identifying underlying story structures which are reflected both by visual content and temporal organization of composing elements. A new framework of video analysis and associated techniques are proposed to automatically parse long programs, to extract story structures and identify story units. The proposed analysis and representation contribute to the extraction of scenes and story units, each representing a distinct locale or event, that cannot be achieved by shot boundary detection alone. Analysis is performed on MPEG compressed video and without a prior models. The result is a compact representation that serves as a summary of the story and allows hierarchical organization of video documents
Article
ing Digital Movies Automatically S. Pfeioeer, R. Lienhart, S. Fischer und W. Eoeelsberg Universit#t Mannheim Praktische Informatik IV L 15, 16 D-68131 Mannheim Abstracting Digital Movies Automatically Silvia Pfeioeer, Rainer Lienhart, Stephan Fischer and Wolfgang Eoeelsberg Praktische Informatik IV University of Mannheim D-68131 Mannheim pfeiffer@pi4.informatik.uni-mannheim.de Abstract Large video on demand databases consisting of thousands of digital movies are not easy to handle: the user must have an attractive means to retrieve his movie of choice. For analog video, movie trailers are produced to allow a quick preview and perhaps stimulate possible buyers. This paper presents techniques to automatically produce such movie abstracts of digtial videos. 1 Introduction In the context of video on demand, it is vital to provide an eOEcient and user-friendly means to select a video from a large video archive. In current movie marketing, it is common to produce a trailer (a short ...