Conference Paper

Similarity-based Multi-Modal Lecture Video Indexing and Retrieval with Deep Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Full-text available
Spoken document retrieval for a specific context is a very trending and interesting area of research. It makes it convenient for users to search through archives of speech data, which is not possible manually as it is very time consuming and expensive. In the current article, we focus on performing the same for political speeches, delivered in a variety of environments. The technique used here takes an archive of spoken documents (audio files) as input and performs automatic speech recognition (ASR) on it to derive the textual transcripts, using deep neural networks (DNN), hidden markov models (HMM) and Gaussian mixture models (GMM). These transcriptions are further pruned for indexing by applying certain pre-processing techniques. Thereafter, it builds time and space efficient index of the documents using wavelet trees for its retrieval. The constructed index is searched through to find the count of occurrences of the words in the query, fired by the users. These counts are then utilized to calculate the term frequency - inverse document frequency (TF-IDF) scores, and then the similarity score of the query with each document is calculated using cosine similarity method. Finally, the documents are ranked based on these scores in the order of relevance. Therefore, the proposed system develops a speech recognition system and introduces a novel indexing scheme, based on wavelet trees for retrieving data.
Article
Full-text available
In this article, we present a new multi-lingual Optical Character Recognition (OCR) system for scanned documents. In the case of Latin characters, current open source systems such as Tesseract provide very high accuracy. However, the accuracy of the multi-lingual documents, including Asian characters, is usually lower than that for Latin-only documents. For example, when the document is the mix of English, Chinese and/or Korean characters, the OCR accuracy is lowered than English-only because the character/text properties of Chinese and Korean are quite different from Latin-type characters. To tackle these problems, we propose a new framework using three neural blocks (a segmenter, a switcher, and multiple recognizers) and the reinforcement learning of the segmenter: The segmenter partitions a given word image into multiple character images, the switcher assigns a recognizer for each sub-image, and the recognizers perform the recognition of assigned sub-images. The training of recognizers and switcher can be considered traditional image classification tasks and we train them with a supervised learning method. However, the supervised learning of the segmenter has two critical drawbacks: Its objective function is sub-optimal and its training requires a large amount of annotation efforts. Thus, by adopting the REINFORCE algorithm, we train the segmenter so as to optimize the overall performance, i.e., we minimize the edit distance of final recognition results. Experimental results have shown that the proposed method significantly improves the performance for multi-lingual scripts and large character set languages without using character boundary labels.
Article
Full-text available
Current use of e-learning management systems(ELMS) in educational institutions is on the rise. These systems are rapidly increasing in regards to volume. Instructional video is one type of content that is inherently large in volume and sequential in nature in accessing the videos which makes it difficult to manage and retrieve information. There has been extensive research on video information retrieval in the past decade. Existing systems need pre-processing by human intervention, are cost prohibitive, or do not exhibit the natural interaction. In this paper, we propose a framework for information retrieval of instructional video content in an ELMS that utilizes Natural Language Understanding/PROCESSING and an Intelligent Agent in a seamless integrated environment to address the key issues of the existing solutions.
Article
Full-text available
We present a novel spatio-temporal descriptor to efficiently represent a video object for the purpose of content-based video retrieval. Features from spatial along with temporal information are integrated in a unified framework for the purpose of retrieval of similar video shots. A sequence of orthogonal processing, using a pair of 1-D multiscale and multispectral filters, on the space-time volume (STV) of a video object (VOB) produces a gradually evolving (smoother) surface. Zero-crossing contours (2-D) computed using the mean curvature on this evolving surface are stacked in layers to yield a hilly (3-D) surface, for a joint multispectro-temporal curvature scale space (MST-CSS) representation of the video object. Peaks and valleys (saddle points) are detected on the MST-CSS surface for feature representation and matching. Computation of the cost function for matching a query video shot with a model involves matching a pair of 3-D point sets, with their attributes (local curvature), and 3-D orientations of the finally smoothed STV surfaces. Experiments have been performed with simulated and real-world video shots using precision-recall metric for our performance study. The system is compared with a few state-of-the-art methods, which use shape and motion trajectory for VOB representation. Our unified approach has shown better performance than other approaches that use combined match-costs obtained with separate shape and motion trajectory representations and our previous work on a simple joint spatio-temporal descriptor (3-D-CSS).
Article
Full-text available
A critical issue of large-scale multimedia retrieval is how to develop an effective framework for ranking the search results. This problem is particularly challenging for content-based video retrieval due to some issues such as short text queries, insufficient sample learning, fusion of multimodal contents, and large-scale learning with huge media data. In this paper, we propose a novel multimodal and multilevel (MMML) ranking framework to attack the challenging ranking problem of content-based video retrieval. We represent the video retrieval task by graphs and suggest a graph based semi-supervised ranking (SSR) scheme, which can learn with small samples effectively and integrate multimodal resources for ranking smoothly. To make the semi-supervised ranking solution practical for large-scale retrieval tasks, we propose a multilevel ranking framework that unifies several different ranking approaches in a cascade fashion. We have conducted empirical evaluations of our proposed solution for automatic search tasks on the benchmark testbed of TRECVID2005. The promising empirical results show that our ranking solutions are effective and very competitive with the state-of-the-art solutions in the TRECVID evaluations.
Article
Full-text available
Visual surveillance produces large amounts of video data. Effective indexing and retrieval from surveillance video databases are very important. Although there are many ways to represent the content of video clips in current video retrieval algorithms, there still exists a semantic gap between users and retrieval systems. Visual surveillance systems supply a platform for investigating semantic-based video retrieval. In this paper, a semantic-based video retrieval framework for visual surveillance is proposed. A cluster-based tracking algorithm is developed to acquire motion trajectories. The trajectories are then clustered hierarchically using the spatial and temporal information, to learn activity models. A hierarchical structure of semantic indexing and retrieval of object activities, where each individual activity automatically inherits all the semantic descriptions of the activity model to which it belongs, is proposed for accessing video clips and individual objects at the semantic level. The proposed retrieval framework supports various queries including queries by keywords, multiple object queries, and queries by sketch. For multiple object queries, succession and simultaneity restrictions, together with depth and breadth first orders, are considered. For sketch-based queries, a method for matching trajectories drawn by users to spatial trajectories is proposed. The effectiveness and efficiency of our framework are tested in a crowded traffic scene.
Article
Full-text available
The amount of captured video is growing with the increased numbers of video cameras, especially the increase of millions of surveillance cameras that operate 24 hours a day. Since video browsing and retrieval is time consuming, most captured video is never watched or examined. Video synopsis is an effective tool for browsing and indexing of such a video. It provides a short video representation, while preserving the essential activities of the original video. The activity in the video is condensed into a shorter period by simultaneously showing multiple activities, even when they originally occurred at different times. The synopsis video is also an index into the original video by pointing to the original time of each activity. Video Synopsis can be applied to create a synopsis of an endless video streams, as generated by webcams and by surveillance cameras. It can address queries like "Show in one minute the synopsis of this camera broadcast during the past day''. This process includes two major phases: (i) An online conversion of the endless video stream into a database of objects and activities (rather than frames). (ii) A response phase, generating the video synopsis as a response to the user's query.
Article
Full-text available
Digital video now plays an important role in medical education, health care, telemedicine and other medical applications. Several content-based video retrieval (CBVR) systems have been proposed in the past, but they still suffer from the following challenging problems: semantic gap, semantic video concept modeling, semantic video classification, and concept-oriented video database indexing and access. In this paper, we propose a novel framework to make some advances toward the final goal to solve these problems. Specifically, the framework includes: 1) a semantic-sensitive video content representation framework by using principal video shots to enhance the quality of features; 2) semantic video concept interpretation by using flexible mixture model to bridge the semantic gap; 3) a novel semantic video-classifier training framework by integrating feature selection, parameter estimation, and model selection seamlessly in a single algorithm; and 4) a concept-oriented video database organization technique through a certain domain-dependent concept hierarchy to enable semantic-sensitive video retrieval and browsing.
Article
Full-text available
In this paper, we propose an optimal key frame representation scheme based on global statistics for video shot retrieval. Each pixel in this optimal key frame is constructed by considering the probability of occurrence of those pixels at the corresponding pixel position among the frames in a video shot. Therefore, this constructed key frame is called temporally maximum occurrence frame (TMOF), which is an optimal representation of all the frames in a video shot. The retrieval performance of this representation scheme is further improved by considering the k pixel values with the largest probabilities of occurrence and the highest peaks of the probability distribution of occurrence at each pixel position for a video shot. The corresponding schemes are called k-TMOF and k-pTMOF, respectively. These key frame representation schemes are compared to other histogram-based techniques for video shot representation and retrieval. In the experiments, three video sequences in the MPEG-7 content set were used to evaluate the performances of the different key frame representation schemes. Experimental results show that our proposed representations outperform the alpha-trimmed average histogram for video retrieval.
Article
Full-text available
Key frames and previews are two forms of a video abstract, widely used for various applications in video browsing and retrieval systems. We propose in this paper a novel method for generating these two abstract forms for an arbitrary video sequence. The underlying principle of the proposed method is the removal of the visual-content redundancy among video frames. This is done by first applying multiple partitional clustering to all frames of a video sequence and then selecting the most suitable clustering option(s) using an unsupervised procedure for cluster-validity analysis. In the last step, key frames are selected as centroids of obtained optimal clusters. Video shots, to which key frames belong, are concatenated to form the preview sequence
Article
Many internet search engines have been developed, however, the retrieval of video clips remains a challenge. This paper considers the retrieval of incident videos, which may contain more spatial and temporal semantics. We propose an encoder-decoder ConvLSTM model that explores multiple embeddings of a video to facilitate comparison of similarity between a pair of videos. The model is able to encode a video into an embedding that integrates both its spatial information and temporal semantics. Multiple video embeddings are then generated from coarse- and fine-grained features of a video to capture high- and low-level meanings. Subsequently, a learning-based comparative model is proposed to compare the similarity of two videos based on their embeddings. Extensive evaluations are presented and show that our model outperforms state-of-the-art methods for several video retrieval tasks on the FIVR-200K, CC_WEB_VIDEO, and EVVE datasets.
Article
Various researches have been performed with video abstraction with the constant development of multimedia technology. However, there are some deficiencies that have been encountered in the pre-processing of video frames before attaining classified video archives. To overcome the drawbacks in pre-processing, feature extraction and classification approaches are considered. Here, video indexing has been anticipated with several features’ extraction with dominant frame generation for the input video frame. Fuzzy-based SVM classifier is utilized to categorize frame set into dominant structures. Multi-dimensional Histogram of Oriented Gradients (HOG) and colour feature extraction are used to extract texture features from the video frame. Using the frame sequence, the vector space of structures is captured; dominant frameworks are utilized in video indexing. Shot transitions’ classification is done with a fuzzy system. Experimental outcomes demonstrate that shot boundary detection accuracy increases with an increase in iterations. The simulation was carried out in MATLAB environment. This technique attains an accuracy of about 95.4%, the precision of 100%, and the F1 score of 100% and a recall of 100%. The misclassification rate is 4.6%. The proposed method shows better trade-off than the existing techniques.
Article
This paper presents a quantitative evaluation of the dynamics-based de-blurring method using an optical character recognition (OCR) technology. Although various image de-blurring algorithms have been studied, there has been no standard performance metric; de-blurred images have often been evaluated in a qualitative manner. In this study, blurry images containing alphanumeric characters were obtained in the course of rapid motion using a robotic vision system. The obtained blurry images were recovered by the dynamics-based de-blurring method. For a quantitative evaluation, OCR rates from the deblurred images by the dynamics-based method were calculated and compared with those by other well-known methods. Experiment results show that the dynamics-based method has the best quantitative results.
Article
In the last decade e-lecturing has become more and more popular. The amount of lecture video data on the World Wide Web (WWW) is growing rapidly. Therefore, a more efficient method for video retrieval in WWW or within large lecture video archives is urgently needed. This paper presents an approach for automated video indexing and video search in large lecture video archives. First of all, we apply automatic video segmentation and key-frame detection to offer a visual guideline for the video content navigation. Subsequently, we extract textual metadata by applying video Optical Character Recognition (OCR) technology on key-frames and Automatic Speech Recognition (ASR) on lecture audio tracks. The OCR and ASR transcript as well as detected slide text line types are adopted for keyword extraction, by which both video- and segment-level keywords are extracted for content-based video browsing and search. The performance and the effectiveness of proposed indexing functionalities is proven by evaluation.
Article
Among possible research area in multimedia, keyframe extraction is an important topic that provides video summarization, faster browsing and accessing of wide video collections. In this paper, we propose a new automatic shot based keyframe extraction for video indexing and retrieval applications. Initially, the frames are sequentially clustered into shots by using feature extraction, continuity value construction steps of shot boundary detection process and the shot frame clustering technique. The cluster having a larger dispersion rate is selected for inter cluster similarity analysis (ICSA) and the sub-shot based keyframes are extracted using ICSA. The proposed shot boundary detection algorithm and video keyframe extraction technique are implemented and evaluated on publicly available ecological video datasets. Compared with existing related algorithms, our method yields better F1-score of 94.2% for shot boundary detection and better results for keyframe extraction. The keyframes extracted by the proposed method are used for video indexing and retrieval.
Article
In this paper, we tackle the problem of matching of objects in video in the framework of the rough indexing paradigm. In this context, the video data are of very low spatial and temporal resolution because they come from partially decoded MPEG compressed streams. This paradigm enables us to achieve our purpose in near real time due to the faster computation on rough data than on original full spatial and temporal resolution video frames.In this context, segmentation of rough video frames is inaccurate and the region features (texture, color, shape) are not strongly relevant. The structure of the objects must be considered in order to improve the robustness of the matching of regions. The problem of object matching can be expressed in terms of region adjacency graph (RAG) matching.Here, we propose a directed acyclic graph (DAG) matching method based on a heuristic in order to approximate object matching. The RAGs to compare are first transformed into DAGs by orienting edges. Then, we compute some combinatoric metrics on nodes in order to classify them by similarity. At the end, a top-down process on DAGs aims to match similar patterns that exist between the two DAGs.The results are compared with those of a method based on relaxation matching.
Article
With the rapid proliferation of multimedia applications that require video data management, it is becoming more desirable to provide proper video data indexing techniques capable of representing the rich semantics in video data. In real-time applications, the need for efficient query processing is another reason for the use of such techniques. We present models that use the object motion information in order to characterize the events to allow subsequent retrieval. Algorithms for different spatiotemporal search cases in terms of spatial and temporal translation and scale invariance have been developed using various signal and image processing techniques. We have developed a prototype video search engine, PICTURESQUE (pictorial information and content transformation unified retrieval engine for spatiotemporal queries) to verify the proposed methods. Development of such technology will enable true multimedia search engines that will enable indexing and searching of the digital video data based on its true content.
Article
The characterization of a video segment by a digital signature is a fundamental task in video processing. It is necessary for video indexing and retrieval, copyright protection, and other tasks. Semantic video signatures are those that are based on high-level content information rather than on low-level features of the video stream. The major advantage of such signatures is that they are highly invariant to nearly all types of distortion. A major semantic feature of a video is the appearance of specific persons in specific video frames. Because of the great amount of research that has been performed on the subject of face detection and recognition, the extraction of such information is generally tractable, or will be in the near future. We have developed a method that uses the pre-extracted output of face detection and recognition to perform fast semantic query-by-example retrieval of video segments. We also give the results of the experimental evaluation of our method on a database of real video. One advantage of our approach is that the evaluation of similarity is convolution-based, and is thus resistant to perturbations in the signature and independent of the exact boundaries of the query segment.
Article
We propose a new method to index an image for the fast content-based image browsing and retrieval in a database using the rosette pattern. By applying the rosette pattern that has more sample lines in the vicinity of center than those in the outer parts, we can get global gray distribution features as well as local positional information. These features are transformed into a histogram and used as database indices. Our method is better than the conventional ones in terms of the reduction of the number of pixels required to index an image and also provides a good retrieval performance