Article

# Keyframe-based video summarization using Delaunay clustering

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

Recent advances in technology have made tremendous amounts of multimedia information available to the general population. An efficient way of dealing with this new development is to develop browsing tools that distill multimedia data as information oriented summaries. Such an approach will not only suit resource poor environments such as wireless and mobile, but also enhance browsing on the wired side for applications like digital libraries and repositories. Automatic summarization and indexing techniques will give users an opportunity to browse and select multimedia document of their choice for complete viewing later. In this paper, we present a technique by which we can automatically gather the frames of interest in a video for purposes of summarization. Our proposed technique is based on using Delaunay Triangulation for clustering the frames in videos. We represent the frame contents as multi-dimensional point data and use Delaunay Triangulation for clustering them. We propose a novel video summarization technique by using Delaunay clusters that generates good quality summaries with fewer frames and less redundancy when compared to other schemes. In contrast to many of the other clustering techniques, the Delaunay clustering algorithm is fully automatic with no user specified parameters and is well suited for batch processing. We demonstrate these and other desirable properties of the proposed algorithm by testing it on a collection of videos from Open Video Project. We provide a meaningful comparison between results of the proposed summarization technique with Open Video storyboard and K-means clustering. We evaluate the results in terms of metrics that measure the content representational value of the proposed technique.

## No full-text available

... Lai et al. [8] and Hoi et al. [9] used visual attention model for KFE. Mundar et al. [10] proposed a method based on Delaunay Triangulation to cluster the key frames [11][12][13]. presented plot yields preferred outcomes over a portion of the notable non-visual attention based plans. ...
... Total time of the record is 75 mins, where each video has a time of 1 min to 4 mins. Our methodology is compared to five techniques (OV [12], DT [10], STIMO [31], VSCAN [33], and VSUMM [11]) utilizing the original assessment measurements for F measurement, precision and recall based on the data set. ...
... Tabs. 5 and 6 shows the MOS score of our pattern and alternative techniques for twenty videos with an average score in the final row. [7] In [13] In [8] In [12] In [10] In [31] In [11] In [36] In [37] Our method ...
Article
Full-text available
Video summarization is applied to reduce redundancy and develop a concise representation of key frames in the video, more recently, video summaries have been used through visual attention modeling. In these schemes, the frames that stand out visually are extracted as key frames based on human attention modeling theories. The schemes for modeling visual attention have proven to be effective for video summaries. Nevertheless, the high cost of computing in such techniques restricts their usability in everyday situations. In this context, we propose a method based on KFE (key frame extraction) technique, which is recommended based on an efficient and accurate visual attention model. The calculation effort is minimized by utilizing dynamic visual highlighting based on the temporal gradient instead of the traditional optical flow techniques. In addition, an efficient technique using a discrete cosine transformation is utilized for the static visual salience. The dynamic and static visual attention metrics are merged by means of a non-linear weighted fusion technique. Results of the system are compared with some existing state-of-the-art techniques for the betterment of accuracy. The experimental results of our proposed model indicate the efficiency and high standard in terms of the key frames extraction as output.
... Lai et al. [8] and Hoi et al. [9] used visual attention model for KFE. Mundar et al. [10] proposed a method based on Delaunay Triangulation to cluster the key frames [11][12][13]. presented plot yields preferred outcomes over a portion of the notable non-visual attention based plans. ...
... Total time of the record is 75 mins, where each video has a time of 1 min to 4 mins. Our methodology is compared to five techniques (OV [12], DT [10], STIMO [31], VSCAN [33], and VSUMM [11]) utilizing the original assessment measurements for F measurement, precision and recall based on the data set. ...
... Tabs. 5 and 6 shows the MOS score of our pattern and alternative techniques for twenty videos with an average score in the final row. [7] In [13] In [8] In [12] In [10] In [31] In [11] In [36] In [37] Our method ...
... Keyframe selection for short videos is well studied in the literature [3,4]. One main category of algorithms is clustering-based [5,6,7,8,9]: grouping frames into clusters, then selecting a representative frame from each cluster. While some methods have low complexity [7,6], they are heuristic-based when setting the number of clusters and selecting keyframes. ...
... VSUMM [7] is the primary dataset for keyframe-based video summarization. Collected from Open Video Project 2 [5] (OVP), it consists of 50 videos of duration 1-4 minutes and resolution of 352×240 in MPEG-1 format (mainly in 30fps). The videos are in different genres: documentary, educational, lecture, and historical, etc. Fifty human subjects participated in constructing user summaries, each annotating five videos. ...
... For comparison, we follow [7,10] and compare our method with DT [5], VSUMM [7], STIMO [6], MSR [12], AGDS [20] and SBOMP [10]. The results for DT, STIMO and VSUMM methods are available from the official VSUMM website 3 . ...
Preprint
Full-text available
We study the problem of efficiently summarizing a short video into several keyframes, leveraging recent progress in fast graph sampling. Specifically, we first construct a similarity path graph (SPG) $\mathcal{G}$, represented by graph Laplacian matrix $\mathbf{L}$, where the similarities between adjacent frames are encoded as positive edge weights. We show that maximizing the smallest eigenvalue $\lambda_{\min}(\mathbf{B})$ of a coefficient matrix $\mathbf{B} = \text{diag}(\mathbf{a}) + \mu \mathbf{L}$, where $\mathbf{a}$ is the binary keyframe selection vector, is equivalent to minimizing a worst-case signal reconstruction error. We prove that, after partitioning $\mathcal{G}$ into $Q$ sub-graphs $\{\mathcal{G}^q\}^Q_{q=1}$, the smallest Gershgorin circle theorem (GCT) lower bound of $Q$ corresponding coefficient matrices -- $\min_q \lambda^-_{\min}(\mathbf{B}^q)$ -- is a lower bound for $\lambda_{\min}(\mathbf{B})$. This inspires a fast graph sampling algorithm to iteratively partition $\mathcal{G}$ into $Q$ sub-graphs using $Q$ samples (keyframes), while maximizing $\lambda^-_{\min}(\mathbf{B}^q)$ for each sub-graph $\mathcal{G}^q$. Experimental results show that our algorithm achieves comparable video summarization performance as state-of-the-art methods, at a substantially reduced complexity.
... The main objective of any clustering algorithm-based video summarization is to reduce the redundancy of video by selecting only the informative frames. There are many clustering algorithms [29] [40] used for this purpose. The main demerit of the existing clustering algorithm-based approaches is that they generally needed predefined number of clusters. ...
... After termination of our proposed algorithm based on this overlapping index, we determine non-overlapping clusters based on the fuzzy belongingness of the objects into the clusters, i.e., the frame closest to a cluster centroid is placed into that cluster. The method is compared with some state-of-the-art clustering algorithms [29] [40] with respect to different cluster validation indices. In the literature, there are many internal [27], external [11] [22] [12], and stabilitybased [3] cluster validity indices. ...
... After achieving the disjoint set of clusters, they are validated and compared with different clustering algorithms based on cluster validation indices, namely internal indices and stability indices. In order to evaluate the effectiveness of the proposed CAFKF clustering algorithm, we compare it with several very popular and frequently used clustering algorithms such as, K-Means clustering (KM) [8], spectral clustering (SC) [9], clustering by affinity propagation (AP) [17], Delaunay clustering (DC) [29]. Table 1. ...
Article
Full-text available
Video summarization is the process of refining the original video into a more concise form without losing valuable information. Both efficient storage and extraction of valuable information from a video are the challenging tasks in video analysis. Intelligent video surveillance system has an essential role for ensuring safety and security to the public. Recent intelligent technologies are extensively using the surveillance systems in all areas starting from border security application to street monitoring systems. Now the surveillance camera or motion sensitivity-based cameras produce large volume of data when employed for recording videos. As analysis of videos by humans demands immense manpower, automatic video summarization is an important and growing research topic. Hence, it is necessary to summarize the activities in the scene and eliminate unusual and redundant events recorded in videos. The proposed work has developed a video summarization framework using key moment-based frame selection and clustering of frames to identify only informative frames. The key moment is a simple yet effective characteristic for summarizing a long video shot and motion is the most salient feature in presenting actions or events in video which is used here to extract the key moments of the video frames. The motion is the scene of a video frame which has the most acceleration and deceleration in case of the key moments. Based on the extracted key moments, the frames of the video are partitioned into different groups using a novel similarity-based agglomerative clustering algorithm. The algorithm determines at most K clusters of frames based on Jaccard similarity among the clusters, where K is the user defined parameter set as the 5% to 15% of the size of the video to be summarized. From each cluster, few representative frames are identified based on the centroids of the clusters and arranged according to their original video sequence to generate the summary of the video. The proposed clustering algorithm and the summa-rization method are evaluated using state-of-the-art video datasets and compared with some related methodologies to demonstrate their effectiveness.
... The main objective of any clustering algorithm-based video summarization is to reduce the redundancy of video by selecting only the informative frames. There are many clustering algorithms [29] [40] used for this purpose. The main demerit of the existing clustering algorithm-based approaches is that they generally needed predefined number of clusters. ...
... After termination of our proposed algorithm based on this overlapping index, we determine non-overlapping clusters based on the fuzzy belongingness of the objects into the clusters, i.e., the frame closest to a cluster centroid is placed into that cluster. The method is compared with some state-of-the-art clustering algorithms [29] [40] with respect to different cluster validation indices. In the literature, there are many internal [27], external [11] [22] [12], and stabilitybased [3] cluster validity indices. ...
... After achieving the disjoint set of clusters, they are validated and compared with different clustering algorithms based on cluster validation indices, namely internal indices and stability indices. In order to evaluate the effectiveness of the proposed CAFKF clustering algorithm, we compare it with several very popular and frequently used clustering algorithms such as, K-Means clustering (KM) [8], spectral clustering (SC) [9], clustering by affinity propagation (AP) [17], Delaunay clustering (DC) [29]. Table 1. ...
Article
Full-text available
Video summarization is the process of refining the original video into a more concise form without losing valuable information. Both efficient storage and extraction of valuable information from a video are the challenging tasks in video analysis. Intelligent video surveillance system has an essential role for ensuring safety and security to the public. Recent intelligent technologies are extensively using the surveillance systems in all areas starting from border security application to street monitoring systems. Now the surveillance camera or motion sensitivity-based cameras produce large volume of data when employed for recording videos. As analysis of videos by humans demands immense manpower, automatic video summarization is an important and growing research topic. Hence, it is necessary to summarize the activities in the scene and eliminate unusual and redundant events recorded in videos. The proposed work has developed a video summarization framework using key moment-based frame selection and clustering of frames to identify only informative frames. The key moment is a simple yet effective characteristic for summarizing a long video shot and motion is the most salient feature in presenting actions or events in video which is used here to extract the key moments of the video frames. The motion is the scene of a video frame which has the most acceleration and deceleration in case of the key moments. Based on the extracted key moments, the frames of the video are partitioned into different groups using a novel similarity-based agglomerative clustering algorithm. The algorithm determines at most K clusters of frames based on Jaccard similarity among the clusters, where K is the user defined parameter set as the 5% to 15% of the size of the video to be summarized. From each cluster, few representative frames are identified based on the centroids of the clusters and arranged according to their original video sequence to generate the summary of the video. The proposed clustering algorithm and the summarization method are evaluated using state-of-the-art video datasets and compared with some related methodologies to demonstrate their effectiveness.
... Many new techniques have been proposed to improve the effectiveness, among which the two commonly used methods are clustering-based and shot segmentation-based. Clustering is a popular unsupervised technique for keyframe selection [4]- [6], which clusters similar frames and selects the frames closest to the centers as keyframes. Most of the current clustering-based approaches cannot detect noise frames automatically, and require the number of clusters as an input. ...
... Clustering based techniques group video frames with similar content into clusters and choose representative frames from each cluster. Mundur et al. represented video frame with a color histogram in HSV space, and used Delaunay Triangulation to cluster video frames, and chose the frames closest to cluster centre as keyframes [4]. Furini et al. proposed to produce keyframes by adopting Furthest-Point-First (FPF) for clustering in HSV color space [5]. ...
... (ii) Clustering based: DT [4], STIMO [5] and VSUMM [6]. (iii) Shot Segmentation based: FSS [51] and MTSA [52]. ...
Article
Full-text available
Video summarization (VS) is generally formulated as a subset selection problem where a set of representative keyframes or key segments is selected from an entire video frame set. Though many sparse subset selection based VS algorithms have been proposed in the past decade, most of them adopt linear sparse formulation in the explicit feature vector space of video frames, and don’t consider the local or global relationships among frames. In this paper, we first extend the conventional sparse subset selection for VS into kernel block sparse subset selection (KBS3) to utilize the advantage of kernel sparse coding and introduce a local inter-frame relationship through packing of frame blocks. Going a step further, we propose a similarity based block sparse subset selection (SB2S3) model by applying a specially designed transformation matrix on the KBS3 model in order to introduce a kind of global inter-frame relationship through the similarity. Finally, a greedy pursuit based algorithm is devised for the proposed NP-hard model optimization. The proposed SB2S3 has the following advantages: 1) through the similarity between each frame and any other frame, the global relationship among all frames can be considered; 2) through block sparse coding, the local relationship of adjacent frames is further considered; and 3) it has a wider application, since features can derive similarity, but not vice versa. It is believed that the effect of modeling such global and local relationships among frames in this paper, is similar to that of modeling the long-range and short-range dependencies among frames in deep learning based methods. Experimental results on three benchmark datasets have demonstrated that the proposed approach is superior to not only other sparse subset selection based VS methods but also most unsupervised deep-learning based VS methods.
... During the dynamic process, a structure may develop large nonlinear deformations, leading to complicated 3D movements. In consequence, the clustering method, one of the widely used key frame extraction methods and have the advantage of handling complicated 3D movements (Zhuang et al., 1998;Yang and Lin 2005;Mundur et al. 2006), may be a suitable choice for structural dynamic analyses. Nevertheless, the existing clustering method (Zhuang et al. 1998;Yang and Lin 2005;Mundur et al. 2006) was not originally designed for the GPU-based rendering and cannot be adapted to different GPU platforms, because the size of the extracted key frames may exceed the memory limitation of a GPU. ...
... In consequence, the clustering method, one of the widely used key frame extraction methods and have the advantage of handling complicated 3D movements (Zhuang et al., 1998;Yang and Lin 2005;Mundur et al. 2006), may be a suitable choice for structural dynamic analyses. Nevertheless, the existing clustering method (Zhuang et al. 1998;Yang and Lin 2005;Mundur et al. 2006) was not originally designed for the GPU-based rendering and cannot be adapted to different GPU platforms, because the size of the extracted key frames may exceed the memory limitation of a GPU. In view of the above, a key frame extraction method specific to the GPU-based rendering is thus necessary to be developed. ...
Chapter
Full-text available
This chapter proposes a suite of high-fidelity computational models for tall building simulation (covering fiber-beam element model, multi-layer shell element model, and so forth), followed by model validation against published experimental results. Subsequently, graphics processing unit (GPU)-based high performance matrix solvers in conjunction with physics engine-based high-performance visualization techniques are proposed, with which the process of simulation and visualization can be greatly accelerated.
... The extraction of dynamic video summary is generally divided into the extraction of video key frames [13], the timing segmentation of video frames [14] and the selection of video sub-shots [15]. In the selection of video subshots, Hao et al. [16] coded the shots from thick to thin by using the hierarchy of tree, and then selected the subshots along the branches. ...
... According to the algorithm HOOI, the high-latitude convolutional layer is decomposed to return the core, W I r3 , W O r4 VOLUME 4, 2016 FIGURE 6. VGG16 feature extraction network of the second, first and third convolution kernels after decomposition. The results after decomposition are shown as in (14). ...
Article
Full-text available
This paper deals with the problem of large number of parameters and complex calculation in video abstract generation of Fully Connected Network and Convolutional Neural Network. At the same time, the training and testing of such model need a lot of time and computer resources. We came up with a deep learning network parameter compression method based on Singular Value Decomposition(SVD) and Trucker Decomposition (TD) is proposed to generate the video summaries. The experiment was compared with other methods on TVSum and SumMe dataset, and the F1 value was 55.3% in TVSum dataset and 46.8% in SumMe dataset. At the same time, the degree of test time shortening under the same data volume is taken as the evaluation basis. The experimental results show that the proposed method achieves 1.04 times of acceleration in the SVD forward calculation, and 1.29 times of acceleration in the TD forward calculation. In a conclusion, the neural network model based on low-rank decomposition can effectively save computer resources and the time consumed by running programs.
... The clustering algorithms cluster video frames and then select the centroids of each cluster to generate the final summary. The k-means clustering method [28], a graph-based technique called "modularity" [29], Delaunay Triangulation [30], Farthest Point-First (FPF) algorithm [31], and Density-Based Spatial Clustering (DBSCAN) [32] are all used in video summarization. ...
... We compare the proposed method against 5 different video summarization methods which also employ the OVP dataset. These methods are Delaunay Triangulation (DT) [30], Still and Moving Video Storyboards (STIMO) [31], VSUMM1 [5], VSUMM2 [5], and OVP summaries. DT algorithm used to cluster the video frames based on the Delaunay Triangulation method and picked each cluster's centroid to generate the summaries. ...
... When it comes to the OV dataset there are several methods over the years that their evaluation is based on this benchmark dataset. Specifically, in [42] a VS method was proposed via the Delaunay Triangulation clustering (DT) where each frame was represented by an HSV color histogram, each histogram was represented as a row vector and the vectors for each frame were concatenated into a matrix while principal component analysis (PCA) was used for dimension reduction. After that the Delaunay diagram was built and clusters were created based on the edges of this diagram. ...
... Every video is watched and evaluated by 5 humans who selected ground truth (GT) frames according to their subjective opinions. Since this dataset is publicly available and is consisted of videos along with their respective summarization GT, its popularity is quite large and thus, several VS methods select it to perform their experiments [58], [12], [42]- [63]. ...
Preprint
Full-text available
The huge amount of video data produced daily by camera-based systems, such as surveilance, medical and telecommunication systems, emerges the need for effective video summarization (VS) methods. These methods should be capable of creating an overview of the video content. In this paper, we propose a novel VS method based on a Generative Adversarial Network (GAN) model pre-trained with human eye fixations. The main contribution of the proposed method is that it can provide perceptually compatible video summaries by combining both perceived color and spatiotemporal visual attention cues in a unsupervised scheme. Several fusion approaches are considered for robustness under uncertainty, and personalization. The proposed method is evaluated in comparison to state-of-the-art VS approaches on the benchmark dataset VSUMM. The experimental results conclude that SalSum outperforms the state-of-the-art approaches by providing the highest f-measure score on the VSUMM benchmark.
... Unsupervised methods generally choose key shots in terms of heuristic criteria, such as relevance, representativeness, and diversity. Among them, cluster-based approaches [16] aggregate visually similar shots into the same group, and the obtained group centers will be selected as final summary. In earlier works, clustering algorithms are directly applied to video summarization [17], and later some researchers combine domain knowledge to improve performance [16], [4]. ...
... Among them, cluster-based approaches [16] aggregate visually similar shots into the same group, and the obtained group centers will be selected as final summary. In earlier works, clustering algorithms are directly applied to video summarization [17], and later some researchers combine domain knowledge to improve performance [16], [4]. Besides, dictionary learning [1] is another stream of unsupervised methods, and it finds key shots to build a dictionary as the representation of video in addition to preserving the local structure of data when necessary. ...
Preprint
Video summarization is an effective way to facilitate video searching and browsing. Most of existing systems employ encoder-decoder based recurrent neural networks, which fail to explicitly diversify the system-generated summary frames while requiring intensive computations. In this paper, we propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention called SUM-GDA, which adapts attention mechanism in a global perspective to consider pairwise temporal relations of video frames. Particularly, the GDA module has two advantages: 1) it models the relations within paired frames as well as the relations among all pairs, thus capturing the global attention across all frames of one video; 2) it reflects the importance of each frame to the whole video, leading to diverse attention on these frames. Thus, SUM-GDA is beneficial for generating diverse frames to form satisfactory video summary. Extensive experiments on three data sets, i.e., SumMe, TVSum, and VTW, have demonstrated that SUM-GDA and its extension outperform other competing state-of-the-art methods with remarkable improvements. In addition, the proposed models can be run in parallel with significantly less computational costs, which helps the deployment in highly demanding applications.
... SumMe TVsum k-medoids [33] 0.334 0.288 Delauny [52] 0.315 0.394 VSUMM [32] 0.335 0.391 SALF [34] 0.378 0.420 LiveLight [53] 0.384 0.477 CSUV [29] 0.393 0.532 LSMO [54] 0.403 0.568 Summary Transfer [55] 0.409 -AVRN 0.441 0.597 by current key-shot set. They get better performance than clustering based approaches. ...
Preprint
Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, existing approaches just exploit the visual information while neglect the audio information. In this paper, we argue that the audio modality can assist vision modality to better understand the video content and structure, and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream LSTM is utilized to encode the audio and visual feature sequentially by capturing their temporal dependency. 2) the audiovisual fusion LSTM is employed to fuse the two modalities by exploring the latent consistency between them. 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information, and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, \emph{i.e.,} SumMe and TVsum, have demonstrated the effectiveness of each part, and the superiority of AVRN compared to those approaches just exploiting visual information for video summarization.
... Unsupervised techniques do not require any complementary annotations. They can be trained on larger amounts of data, which often improve the reliability and the generality of the model [17,18,[35][36][37][38][39][40]. Although these works have shown good results in video summarization and highlight generation, these works rely on features from audio and video. ...
Article
Full-text available
In this paper, we present HOMER, a cloud-based system for video highlight generation which enables the automated, relevant, and flexible segmentation of videos. Our system outperforms state-of-the-art solutions by fusing internal video content-based features with the user’s emotion data. While current research mainly focuses on creating video summaries without the use of affective data, our solution achieves the subjective task of detecting highlights by leveraging human emotions. In two separate experiments, including videos filmed with a dual camera setup, and home videos randomly picked from Microsoft’s Video Titles in the Wild (VTW) dataset, HOMER demonstrates an improvement of up to 38% in F1-score from baseline, while not requiring any external hardware. We demonstrated both the portability and scalability of HOMER through the implementation of two smartphone applications.
... There are two forms of output summarization. The first is selected key frames [10,27,34] as the output: ...
Article
Full-text available
Surveillance videos which record crowd behaviors have dramatically increased due to the wide applications. A quick view of such crowd surveillance video in a constrained time is an increasing demand because it always contain a huge number of redundancy frames. In this paper, we focus on summarization of crowd surveillance videos. But it is not easy due to two reasons. First, how to make the decision to keep or discard a subshot from the input surveillance video stream so that the summary can outline the main behaviors of the crowd over a limited frames sequence. Second, how to maintain performance of summarization model for long surveillance videos. To tackle these challenges, we formulate surveillance video summarization as a sequential decision-making process and train the summarization network with reinforcement learning-based framework. A novel crowd location-density reward is proposed to teach summarization network to produce high-quality summaries. In addition, a summarization network with three layers LSTM is designed to maintain performance across longer time spans. Extensive experiments on three public crowd surveillance videos datasets show that the proposed method achieves state-of-the-art performance.
... For example, the sets of points closest in terms of Euclidean distance to the points taken in a Euclidean space form a Voronoi diagram of this space. Moreover, seeding points and triangles emerging from the adjacency relation of Voronoi cells constitute the Delaunay triangulation of the underlying metric space [6][7][8][9]. Network models, represented by mathematically simple directed/undirected graphs, together with the graph distance, form a discrete metric space. In determining Voronoi diagrams for this type of metric space, sets of points with minimum graph distances to seeding points are taken into account [10][11][12]. ...
Article
Full-text available
Community structure detection is an important and valuable task in financial network studies as it forms the basis of many statistical applications such as prediction, risk analysis, and recommendation. Financial networks have a natural multi-grained structure that leads to different community structures at different levels. However, few studies pay attention to these multi-part features of financial networks. In this study, we present a geometric coarse graining method based on Voronoi regions of a financial network. Rather than studying the dense structure of the network, we perform our analysis on the triangular maximally filtering of a financial network. Such filtered topology emerges as an efficient approach because it keeps local clustering coefficients steady and it underlies the network geometry. Moreover, in order to capture changes in coarse grains geometry throughout a financial stress, we study Haantjes curvatures of paths that are the farthest from the center in each of the Voronoi regions. We performed our analysis on a network representation com- prising the stock market indices BIST (Borsa Istanbul), FTSE100 (London Stock Exchange), and Nasdaq-100 Index (NASDAQ), across three financial crisis periods. Our results indicate that there are remarkable changes in the geometry of coarse grains.
... In clustering algorithms [2,6,14,17,21,29] the frames of the video are clustered according to feature similarity using any clustering procedure and the cluster center is chosen as keyframe. Kumar et al. [17] employed k-means clustering procedure to extract keyframes. ...
Article
Full-text available
Several computer vision applications such as e-learning, video editing, video compression, video-on-demand and surveillance etc. are popular in recent days. Most of the applications need videos to be retrieved and processed regularly. First and foremost step towards video retrieval and management is keyframe extraction. The perfect identification of shot transition boundaries is trivial in extracting keyframes. In present article, a framework for shot transition detection and keyframe extraction have been proposed. The proposed method is efficient, simple and does not require supervision which makes it attractive. The proposed method establishes the shot transition boundaries by estimating feature similarity (FSIM) between gradient magnitudes of consecutive frames. Then the frame with the highest mean and standard deviation is chosen as keyframe to that shot. In any situation if one feature fails to establish shot transition boundary another feature may succeed in establishment of shot transition boundary at proper frame locations of video. The proposed algorithm is tested on four different datasets, among them one is developed by us, two are well known standard datasets to evaluate keyframe extraction algorithm and the other one is standard surveillance video dataset. All the datasets are publicly available. Performance evaluation of the method is done in terms of Figure of merit, Detection percentage, Accuracy and Missing factor. The experimental results prove that the proposed method outperforms other state-of-art methods.
... Clustering algorithms are widely adopted in earlier methods, including delaunay clustering [15], k-means [16] and so on. The frame sequence is allocated into several clusters. ...
Preprint
Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among video frames, which limits the performance. Transformer is an effective model to deal with this problem, and surpasses RNN-based methods in several sequence modeling tasks, such as machine translation, video captioning, \emph{etc}. Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization, which can capture the dependencies among frame and shots, and summarize the video by exploiting the scene information formed by shots. Furthermore, we argue that both the audio and visual information are essential for the video summarization task. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. In this paper, the proposed method is denoted as Hierarchical Multimodal Transformer (HMT). Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
... Padmavathi Mundur [17], proposed an automatic video summarization technique based onDelaunay Triangulation. They presented meaningful comparisons of the results from the proposed DT algorithm to OV storyboard by defining metrics such as significance factor, overlap factor, and the compression factor all of which evaluate the representational power of the proposed DT summarization technique. ...
... Li et al. in [27] patented the summarization technique of using the plurality of video segments. Padmavathi Mundur et al. in [28] improved upon traditional clustering techniques that depend on the input data by generating multi-dimensional point data from the frame content and using Delaunay Triangulation for clustering. Parikh et al., in [29], the optimal camera placement algorithm is presented in such a way that the maximum surveillance area can be covered. ...
... Early video summarization approaches are unsupervised and made use of low level similarity measures between the frames [14], [15], [16], [17], [18]; while recent unsupervised studies apply Generative Adversarial Networks (GANs) [19] and attention [20], [21] to solve the video summarization problem. They use GANs to reconstruct the input video from the selected key frames for unsupervised video summarization. ...
Preprint
Increasing volume of user-generated human-centric video content and their applications, such as video retrieval and browsing, require compact representations that are addressed by the video summarization literature. Current supervised studies formulate video summarization as a sequence-to-sequence learning problem and the existing solutions often neglect the surge of human-centric view, which inherently contains affective content. In this study, we investigate the affective-information enriched supervised video summarization task for human-centric videos. First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes. Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM). In addition, we investigate the use of attention to improve the AVSUM architectures and propose two new architectures based on temporal attention (TA-AVSUM) and spatial attention (SA-AVSUM). We conduct video summarization experiments on the TvSum database. The proposed AVSUM-GRU architecture with an early fusion of high level GRU embeddings and the temporal attention based TA-AVSUM architecture attain competitive video summarization performances by bringing strong performance improvements for the human-centric videos compared to the state-of-the-art in terms of F-score and self-defined face recall metrics.
Preprint
This paper proposes an efficient video summarization framework that will give a gist of the entire video in a few key-frames or video skims. Existing video summarization frameworks are based on algorithms that utilize computer vision low-level feature extraction or high-level domain level extraction. However, being the ultimate user of the summarized video, humans remain the most neglected aspect. Therefore, the proposed paper considers human's role in summarization and introduces human visual attention-based summarization techniques. To understand human attention behavior, we have designed and performed experiments with human participants using electroencephalogram (EEG) and eye-tracking technology. The EEG and eye-tracking data obtained from the experimentation are processed simultaneously and used to segment frames containing useful information from a considerable video volume. Thus, the frame segmentation primarily relies on the cognitive judgments of human beings. Using our approach, a video is summarized by 96.5% while maintaining higher precision and high recall factors. The comparison with the state-of-the-art techniques demonstrates that the proposed approach yields ceiling-level performance with reduced computational cost in summarising the videos.
Article
Determining an accurate flow direction is a prerequisite for hydrographic analysis and river generalization. For river networks with complex spatial structures and little semantic information, it has always been a difficult and interesting problem to automatically and objectively determine the flow direction of all rivers. In a river network with many estuaries, tributaries may either be “simple tributaries” that are associated only with a single main channel or “bridging tributaries” that link multiple main channels. The former is large in number, while the latter is few, but both of them are very important in constructing the hierarchical relationship of the whole river network. Existing studies have concentrated on simple tributaries, and flow direction reasoning for bridging tributaries is ignored, leading to a misunderstanding of the spatial structure of river networks. To address this problem, an automatic method of flow direction reasoning for bridging tributaries using adjacency relation (FDR‐BR method) is proposed. First, in view of the insufficient semantic information in actual river data, the principle of “the minority is subordinate to the majority” is adopted to establish two statistical identification criteria for estuaries, and main channels are extracted by using the good continuity feature between river reaches, that is, stroke feature. Second, the Kth‐order adjacency fields are constructed for each main channel based on the topological connection relationships between the main channels and tributaries. Finally, the “split river reaches” are detected from the bridging tributaries, and the spatial adjacency relation is used as a constraint to identify the benchmark main channel of each bridging tributary. The FDR‐BR method is validated using a geographical census dataset for a city in China. In the experimental area, simple river networks with independent main channels account for 91.67% of the networks, and complex river networks with multiple main channels account for 8.33%. For the complex river networks, 77.79% of the tributaries are simple tributaries, while 13.27% are bridging tributaries. The experimental results reveal that for all rivers in the simple river networks and the simple tributaries in the complex river networks, the flow direction reasoning results of FDR‐BR method are consistent with the results of the state‐of‐the‐art Schwenk method, with an accuracy of more than 98%; for the bridging tributaries in the complex river networks, the flow direction reasoning accuracy of the Schwenk method is only 59.3%, while the accuracy of FDR‐BR method reaches 98%, and the results are more realistic.
Article
In this paper, we propose a Detect-to-Summarize network (DSNet) framework for supervised video summarization. Our DSNet contains anchor-based and anchor-free counterparts. The anchor-based method generates temporal interest proposals to determine and localize the representative contents of video sequences, while the anchor-free method eliminates the pre-defined temporal proposals and directly predicts the importance scores and segment locations. Different from existing supervised video summarization methods which formulate video summarization as a regression problem without temporal consistency and integrity constraints, our interest detection framework is the first attempt to leverage temporal consistency via the temporal interest detection formulation. Specifically, in the anchor-based approach, we first provide a dense sampling of temporal interest proposals with multi-scale intervals that accommodate interest variations in length, and then extract their long-range temporal features for interest proposal location regression and importance prediction. Notably, positive and negative segments are both assigned for the correctness and completeness information of the generated summaries. In the anchor-free approach, we alleviate drawbacks of temporal proposals by directly predicting importance scores of video frames and segment locations. Particularly, the interest detection framework can be flexibly plugged into off-the-shelf supervised video summarization methods. We evaluate the anchor-based and anchor-free approaches on the SumMe and TVSum datasets. Experimental results clearly validate the effectiveness of the anchor-based and anchor-free approaches.
Article
Full-text available
Key frames provide a suitable video summary and framework for video indexing, browsing and retrieval, and how to extract key frames is the core in the field of video retrieval. In order to solve the problem of redundancy and missing selection in key frame extraction, this paper proposed a key frame extraction algorithm based on optical flow and mutual information entropy. This algorithm integrates mutual information entropy and optical flow characteristics to extract key frames. First, the optical flow method calculated the total optical flow of each frame of the image, and selected the video frame with extreme optical flow difference in the neighbourhood as the key frame, and the other video frames were put into the candidate key frame set; by calculating the mutual information entropy of the key frame set, the minimum mutual information entropy is taken as the threshold; Then, The mutual information entropy of the candidate key frame set was calculated, and the image frames larger than the threshold value were put into the key frame set; Finally, redundant key frames are deleted through the measurement of inter-frame similarity, and the remaining key frames are the key frames to be extracted. The experimental results show that compared with other methods, the accuracy of extracted key frames by this approach is significantly improved on precision, recall and F-score, and the extracted key frames can reflect video content more effectively.
Article
Current approaches mainly devote to modeling the video as a frame sequence by recurrent neural networks. However, one potential limitation of the sequence models is that they focus on capturing local neighborhood dependencies while the high-order dependencies in long distance are not fully exploited. In general, the frames in each shot record a certain activity and vary smoothly over time, but the multi-hop relationships occur frequently among shots. In this case, both the local and global dependencies are important for understanding the video content. Motivated by this point, we propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically, where the frame-level dependencies are encoded by Long Short-Term Memory (LSTM), and the shot-level dependencies are captured by the Graph Convolutional Network (GCN). Then, the videos are summarized by exploiting both the local and global dependencies among shots. Besides, a reconstructor is developed to reward the summary generator, so that the generator can be optimized in an unsupervised manner, which can avert the lack of annotated data in video summarization. Practically, experiments on three popular datasets have demonstrated the superiority of our proposed approach.
Article
Complex junctions are typical microstructures in large‐scale road networks with intricate structures and varied morphologies. It is a challenge to identify junctions in map generalization and car navigation tasks accurately. Generally, traditional recognition methods rely on low‐level characteristics of manual design, such as parallelism and symmetry. In recent years, preliminary studies using deep learning‐based recognition methods were conducted. However, only a few junction types can be recognized by existing methods, and these methods cannot effectively identify junctions with irregular shapes and numerous interference sections. Hence, this article proposes a complex junction recognition method based on the GoogLeNet model. First, the Delaunay triangulation clustering algorithm was used to automatically identify the center point and spatial range of training samples for complex junctions. Second, vector training samples were selected from OpenStreetMap (OSM) data of 39 cities across China, and the samples were then augmented through simplification, rotation, and mirroring. Finally, the vector sample data were transformed into raster images, and the GoogLeNet model was trained to learn the high‐level fuzzy characteristics. Experiments based on OSM data from Tianjin city, China, revealed that compared with state‐of‐the‐art methods, the proposed method effectively identified more types of complex junctions and achieved a significantly higher identification accuracy. Furthermore, the proposed method has strong generalizability and anti‐interference capability.
Article
Video summarization is an effective way to facilitate video searching and browsing. Most of existing systems employ encoder-decoder based recurrent neural networks, which fail to explicitly diversify the system-generated summary frames while requiring intensive computations. In this paper, we propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention called SUM-GDA, which adapts attention mechanism in a global perspective to consider pairwise temporal relations of video frames. Particularly, the GDA module has two advantages: (1) it models the relations within paired frames as well as the relations among all pairs, thus capturing the global attention across all frames of one video; (2) it reflects the importance of each frame to the whole video, leading to diverse attention on these frames. Thus, SUM-GDA is beneficial for generating diverse frames to form satisfactory video summary. Extensive experiments on three data sets, i.e., SumMe, TVSum, and VTW, have demonstrated that SUM-GDA and its extension outperform other competing state-of-the-art methods with remarkable improvements. In addition, the proposed models can be run in parallel with significantly less computational costs, which helps the deployment in highly demanding applications.
Chapter
Video summarization is a process of reducing the video content in order to generate a summary in the video format. The summary created should contain the most important parts of the original video. With the video content increasing at a rapid rate day by day, the automatic video summarization will be beneficial for anyone who wants to save time and learn more in less time. This paper gives an insight of summarizing lecture videos using subtitles. From the evaluations done on NPTEL (National Programme on Technology Enhanced Learning) videos against human-made summaries, it is found that the proposed approached is effective in doing the same. It shows that the punctuations in the subtitles play a major role in summarizing lecture videos. By using the punctuations along with text in subtitles, it gives an average ROGUE precision of 0.822, an average recall of 0.802 and an average F-measure of 0.805.
Article
Keyframe extraction is an effective way to achieve video summarization. More recent studies using deep learning networks are heavily dependent on massive historical datasets for training. For practicality in real applications, we focus more on unsupervised online analysis and present a novel graph-based structural difference analysis method for this purpose. Unlike traditional methods of video representation based on raw features, undirected weighted graphs are constructed from the resulting features to represent video frames. The detailed structural changes between graphs are more consistent with the actual changes between video frames than raw features, thus making the newly proposed method robust for detecting various types of shot transitions, such as hard cuts, dissolves, wipes, and fade-ins/fade-outs. Then, considering the local influence between successive frames, a structural difference analysis of graphs is performed to detect the video shot boundaries. Finally, the median graph of each shot is obtained to extract the corresponding keyframe. Extensive experiments are conducted on three video summarization benchmark datasets. Quantitative and qualitative comparisons are made between the proposed method and other state-of-the-art methods, with the proposed method yielding remarkable improvements from 1.9% to 3.1% in terms of the F-score on the three datasets.
Article
Video Summarization (VS) has become one of the most effective solutions for quickly understanding a large volume of video data. Dictionary selection with self representation and sparse regularization has demonstrated its promise for VS by formulating the VS problem as a sparse selection task on video frames. However, existing dictionary selection models are generally designed only for data reconstruction, which results in the neglect of the inherent structured information among video frames. In addition, the sparsity commonly constrained by L2,1 norm is not strong enough, which causes the redundancy of keyframes, i.e., similar keyframes are selected. Therefore, to address these two issues, in this paper we propose a general framework called graph convolutional dictionary selection with L2,p (0 < p ≤ 1) norm (GCDS2,p) for both keyframe selection and skimming based summarization. Firstly, we incorporate graph embedding into dictionary selection to generate the graph embedding dictionary, which can take the structured information depicted in videos into account. Secondly, we propose to use L2,p (0 < p ≤ 1) norm constrained row sparsity, in which p can be flexibly set for two forms of video summarization. For keyframe selection, 0 < p < 1 can be utilized to select diverse and representative keyframes; and for skimming, p = 1 can be utilized to select key shots. In addition, an efficient iterative algorithm is devised to optimize the proposed model, and the convergence is theoretically proved. Experimental results including both keyframe selection and skimming based summarization on four benchmark datasets demonstrate the effectiveness and superiority of the proposed method.
Article
Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among video frames, which limits the performance. Transformer is an effective model to deal with this problem, and surpasses RNN-based methods in several sequence modeling tasks, such as machine translation, video captioning, etc.. Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization, which can capture the dependencies among frame and shots hierarchically. Furthermore, we argue that both the audio and visual information are essential for the video summarization task. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. In this paper, the proposed method is denoted as Hierarchical Multimodal Transformer (HMT). Practically, extensive experiments are conducted on two benchmarks, SumMe and TVsum, where the effectiveness of the hierarchical structure and multimodal fusion mechanism is verified, and the superiority of HMT compared to RNN-based methods is demonstrated.
Article
Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among video frames, which limits the performance. Transformer is an effective model to deal with this problem, and surpasses RNN-based methods in several sequence modeling tasks, such as machine translation, video captioning, etc. Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization, which can capture the dependencies among frame and shots, and summarize the video by exploiting the scene information formed by shots. Furthermore, we argue that both the audio and visual information are essential for the video summarization task. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. In this paper, the proposed method is denoted as Hierarchical Multimodal Transformer (HMT). Practically, extensive experiments show that HMT achieves (F-measure: 0.441, Kendall’s τ: 0.079, Spearman’s ρ: 0.080) and (F-measure: 0.601, Kendall’s τ: 0.096, Spearman’s ρ: 0.107) on SumMe and TVsum, respectively. It surpasses most of the traditional, RNN-based and attention-based video summarization methods.
Article
Video summary technology based on keyframe extraction is an effective means to rapidly access video content. Traditional video summary generation technology requires high video resolution, which poses a problem as most existing studies have no targeted solutions for videos that are subject to privacy protection. We propose a novel keyframe extraction algorithm for video data in the visual shielding domain, named visual shielding compressed sensing coding and double-layer affinity propagation (VSCS-DAP). VSCS-DAP involves three main steps. First, the video is compressed by compressed sensing technology to provide a visual shielding effect (protecting the privacy of monitored figures), while the data volume is significantly reduced. Then, pyramid histogram of oriented gradients (PHOG) features are extracted from the compressed video to be clustered by the first step affinity propagation (AP) to gain the summaries of the first stage. Finally, the PHOG and Hist fusion features are extracted from the keyframes of the first stage, and they cluster the fused PHOG-Hist features by the second step AP algorithm to obtain the final output summaries. Experimental results obtained on two common video datasets show that our method exhibits advantages including low redundancy and few missing frames, low computational complexity, strong real-time performance, and robustness to vision-shielded video.
Preprint
Exploiting the inner-shot and inter-shot dependencies is essential for key-shot based video summarization. Current approaches mainly devote to modeling the video as a frame sequence by recurrent neural networks. However, one potential limitation of the sequence models is that they focus on capturing local neighborhood dependencies while the high-order dependencies in long distance are not fully exploited. In general, the frames in each shot record a certain activity and vary smoothly over time, but the multi-hop relationships occur frequently among shots. In this case, both the local and global dependencies are important for understanding the video content. Motivated by this point, we propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically, where the frame-level dependencies are encoded by Long Short-Term Memory (LSTM), and the shot-level dependencies are captured by the Graph Convolutional Network (GCN). Then, the videos are summarized by exploiting both the local and global dependencies among shots. Besides, a reconstructor is developed to reward the summary generator, so that the generator can be optimized in an unsupervised manner, which can avert the lack of annotated data in video summarization. Furthermore, under the guidance of reconstruction loss, the predicted summary can better preserve the main video content and shot-level dependencies. Practically, the experimental results on three popular datasets i.e., SumMe, TVsum and VTW) have demonstrated the superiority of our proposed approach to the summarization task.
Chapter
Video summarization is the process of automatically extracting relevant frames or segments from a video that can best represent the contents of the video. In the proposed framework, a modified block-based clustering technique is implemented for video summarization. The clustering technique employed is feature agglomeration clustering which results in dimensionality reduction and makes the system an optimized one. The sampled frames from the video are divided into varying number of blocks and clustering is employed on corresponding block sets of all frames rather than clustering frames as a whole. Additionally, image compression based on Discrete Cosine Transform is applied on the individual frames. Results prove that the proposed framework can produce optimum results by varying the block sizes in a computationally efficient manner for videos of different duration. Moreover, the division of frames into blocks before applying clustering ensures that maximum information is retained in the summary.
Chapter
In recent years, the amount of video data has been increasing explosively, and the requirements for video summarization technology have also increased. Video summarization is a summary of the video. By browsing the video summarization, users can quickly understand the content of the video. The traditional video summarization algorithms extract the global features of the video frames to form video summarization. However, these algorithms have obvious disadvantages. Therefore, we propose a method to generate video summarization by fusing the global and local features of video frames, and clustering video frames by DBSCAN algorithm. By comparing with the video summarization manually selected by multiple users, we achieve better results on OVP and YouTube datasets than previous algorithms.
Article
Automatic analysis of construction video footage is beneficial for project management tasks such as productivity analysis and safety control. However, construction videos are usually long in duration and only contain limited useful information to engineers, while the storage of video data from construction projects is challenging. To obtain and store useful video footage systematically and concisely, this research proposes a vision-based method to automatically generate video highlights from construction videos. The proposed approach is validated through two case studies: a gate scenario and an earthmoving scenario. In experiments, the proposed method has achieved 89.2% on precision and 93.3% on recall, which outperforms the feature-based method by 12.7% and 17.2% on precision and recall, respectively. Meanwhile, the proposed method reduces the required digital storage space by 94.6%. The proposed approach offers potential benefits to construction management in terms of significantly reducing video storage space and efficiently indexing construction video footage.
Chapter
Adequate coverage is one of the main problems for Sensor Networks. The effectiveness of distributed wireless sensor networks highly depends on the sensor deployment scheme. Optimizing the sensor deployment provides sufficient sensor coverage and saves cost of sensors for locating in grid points. This article applies the modified binary particle swarm optimization algorithm for solving the sensor placement in distributed sensor networks. PSO is an inherent continuous algorithm, and the discrete PSO is proposed to be adapted to discrete binary space. In the distributed sensor networks, the sensor placement is an NP-complete problem for arbitrary sensor fields. One of the most important issues in the research fields, the proposed algorithms will solve this problem by considering two factors: the complete coverage and the minimum costs. The proposed method on sensors surrounding is examined in different area. The results not only confirm the successes of using the new method in sensor placement, also they show that the new method is more efficiently compared to other methods like Simulated Annealing(SA), PBIL and LAEDA.
Conference Paper
Full-text available
Key frame extraction has been recognized as one of the important research issues in video information retrieval. Although progress has been made in key frame extraction, the existing approaches are either computationally expensive or ineffective in capturing salient visual content. We first discuss the importance of key frame selection; and then review and evaluate the existing approaches. To overcome the shortcomings of the existing approaches, we introduce a new algorithm for key frame extraction based on unsupervised clustering. The proposed algorithm is both computationally simple and able to adapt to the visual content. The efficiency and effectiveness are validated by large amount of real-world videos
Article
Full-text available
This paper presents an overview of color and texture descriptors that have been approved for the Final Committee Draft of the MPEG-7 standard. The color and texture descriptors that are described in this paper have undergone extensive evaluation and development during the past two years. Evaluation criteria include effectiveness of the descriptors in similarity retrieval, as well as extraction, storage, and representation complexities. The color descriptors in the standard include a histogram descriptor that is coded using the Haar transform, a color structure histogram, a dominant color descriptor, and a color layout descriptor. The three texture descriptors include one that characterizes homogeneous texture regions and another that represents the local edge distribution. A compact descriptor that facilitates texture browsing is also defined. Each of the descriptors is explained in detail by their semantics, extraction and usage. The effectiveness is documented by experimental results
Article
Full-text available
Key frames and previews are two forms of a video abstract, widely used for various applications in video browsing and retrieval systems. We propose in this paper a novel method for generating these two abstract forms for an arbitrary video sequence. The underlying principle of the proposed method is the removal of the visual-content redundancy among video frames. This is done by first applying multiple partitional clustering to all frames of a video sequence and then selecting the most suitable clustering option(s) using an unsupervised procedure for cluster-validity analysis. In the last step, key frames are selected as centroids of obtained optimal clusters. Video shots, to which key frames belong, are concatenated to form the preview sequence
Article
Full-text available
A video sequence can be represented as a trajectory curve in a high dimensional feature space. This video curve can be analyzed by tools similar to those developed for planar curves. In particular, the classic binary curve splitting algorithm has been found to be a useful tool for video analysis. With a splitting condition that checks the dimensionality of the curve segment being split, the video curve can be recursively simplified and represented as a tree structure, and the frames that are found to be junctions between curve segments at different levels of the tree can be used as keyframes to summarize the video sequences at different levels of detail. These keyframes can be combined in various spatial and temporal configurations for browsing purposes. We describe a simple video player that displays the keyframes sequentially and lets the user change the summarization level on the fly with a slider. We also describe an approach to automatically selecting a summarization level that pr...
Article
An automatic authoring system for the generation of pictorial transcripts of video programs which are accompanied by closed caption information is presented. A number of key frames, each of which represents the visual information in a segment of the video (i.e., a scene), are selected automatically by performing a content-based sampling of the video program. The textual information is recovered from the closed caption signal and is initially segmented based on its implied temporal relationship with the video segments. The text segmentation boundaries are then adjusted, based on lexical analysis and/or caption control information, to account for synchronization errors due to possible delays in the detection of scene boundaries or the transmission of the caption information. The closed caption text is further refined through linguistic processing for conversion to lower- case with correct capitalization. The key frames and the related text generate a compact multimedia presentation of the contents of the video program which lends itself to efficient storage and transmission. This compact representation can be viewed on a computer screen, or used to generate the input to a commercial text processing package to generate a printed version of the program.
Article
We introduce a geometric transformation that allows Voronoi diagrams to be computed using a sweepline technique. The transformation is used to obtain simple algorithms for computing the Voronoi diagram of point sites, of line segment sites, and of weighted point sites. All algorithms haveO(n logn) worst-case running time and useO(n) space.
Conference Paper
This paper reports two studies that measured the effects of different "video skim" techniques on comprehension , navigation, and user satisfaction. Video skims are compact, content-rich abstractions of longer videos, condensations that preserve frame rate while greatly reducing viewing time. Their characteristics depend on the image- and audio-processing techniques used to create them. Results from the initial study helped refine video skims, which were then reassessed in the second experiment. Significant benefits were found for skims built from audio sequences meeting certain criteria.
Conference Paper
Data clustering is an important technique for visual data management. Most previous work focuses on clustering video data within single sources. We address the problem of clustering across sources, and propose novel spectral clustering algorithms for multisource clustering problems. Spectral clustering is a new discriminative method realizing clustering by partitioning data graphs. We represent multi-source data as bipartite or K-partite graphs, and investigate the spectral clustering algorithm under these representations. The algorithms are evaluated using the TRECVID-2003 corpus with semantic features extracted from speech transcripts and visual concept recognition results from videos. The experiments show that the proposed bipartite clustering algorithm significantly outperforms the regular spectral clustering algorithm in capturing cross-source associations.
Article
In this paper, we propose novel video summarization and retrieval systems based on unique properties from singular value decomposition (SVD). Through mathematical analysis, we derive the SVD properties that capture both the temporal and spatial characteristics of the input video in the singular vector space. Using these SVD properties, we are able to summarize a video by outputting a motion video summary with the user-specified length. The motion video summary aims to eliminate visual redundancies while assigning equal show time to equal amounts of visual content for the original video program. On the other hand, the same SVD properties can also be used to categorize and retrieve video shots based on their temporal and spatial characteristics. As an extended application of the derived SVD properties, we propose a system that is able to retrieve video shots according to their degrees of visual changes, color distribution uniformities, and visual similarities.
Article
In this article we describe the primary goals of the Open Video Digital Library (OVDL), its evolution and current status. We provide overviews of the OVDL user interface research and user studies we have conducted with it, and we outline our plans for future Open Video related activities.
Article
An easily implemented modification to the divide-and-conquer algorithm for computing the Delaunay triangulation ofn sites in the plane is presented. The change reduces its Θ(n logn) expected running time toO(n log logn) for a large class of distributions that includes the uniform distribution in the unit square. Experimental evidence presented demonstrates that the modified algorithm performs very well forn≤216, the range of the experiments. It is conjectured that the average number of edges it creates—a good measure of its efficiency—is no more than twice optimal forn less than seven trillion. The improvement is shown to extend to the computation of the Delaunay triangulation in theL p metric for 1p≤∞.
Conference Paper
In this paper, we propose a novel video summarization technique using which we can automatically generate high quality video summaries suitable for wireless and mobile environments. The significant contribution of this paper lies in the proposed clustering scheme. We use Delaunay diagrams to cluster multidimensional point data corresponding to the frame contents of the video. In contrast to the existing clustering techniques used for summarization, our clustering algorithm is fully automatic and well suited for batch processing. We illustrate the quality of our clustering and summarization scheme in an experiment using several video clips.
Conference Paper
Video compression and retrieval have been treated as separate problems in the past. We present an object-based video representation that facilitates both compression and retrieval. Typically in retrieval applications, a video sequence is subdivided in time into a set of shorter segments each of which contains similar content. These segments are represented by 2-D representative images called “key-frames” that greatly reduce amount of data that is searched. However, key-frames do not describe the motions and actions of objects within the segment. We propose a representation that extends the ideas of the key-frame to further include what we define as “key-objects”. These key-objects consist of regions within a key-frame that move with similar motion. Thus our key-objects allow a retrieval system to more efficiently present information to users and assist them in browsing and retrieving relevant video content
Article
We use principal component analysis (PCA) to reduce the dimensionality of features of video frames for the purpose of content description. This low-dimensional description makes practical the direct use of all the frames of a video sequence in later analysis. The PCA representation circumvents or eliminates several of the stumbling blocks in current analysis methods and makes new analyses feasible. We demonstrate this with two applications. The first accomplishes high-level scene description without shot detection and key-frame selection. The second uses the time sequences of motion data from every frame to classify sports sequences
Article
Extracting a small number of key frames that can abstract the content of video is very important for efficient browsing and retrieval in video databases. In this paper, the key frame extraction problem is considered from a set-theoretic point of view, and systematic algorithms are derived to find a compact set of key frames that can represent a video segment for a given degree of fidelity. The proposed extraction algorithms can be hierarchically applied to obtain a tree-structured key frame hierarchy that is a multilevel abstract of the video. The key frame hierarchy enables an efficient content-based retrieval by using the depth-first search scheme with pruning., Intensive experiments on a variety of video sequences are presented to demonstrate the improved performance of the proposed algorithms over the existing approaches
Article
Widespread clustering methods require user-specified arguments and prior knowledge to produce their best results. This demands pre-processing and/or several trial and error steps. Both are extremely expensive and inefficient for massive data sets. The need to find best-fit arguments in semi-automatic clustering is not the only concern, the manipulation of data to find the arguments opposes the philosophy of "let the data speak for themselves" that underpins exploratory data analysis. Our new approach consists of effective and efficient methods for discovering cluster boundaries in point-data sets. The approach automatically extracts boundaries based on Voronoi modelling and Delaunay Diagrams. Parameters are not specified by users in our automatic clustering. Rather, values for parameters are revealed from the proximity structures of the Voronoi modelling, and thus, an algorithm, AUTOCLUST, calculates them from the Delaunay Diagram. This not only removes human-generated bias, but also re...
Article
Introduction The Voronoi diagram of a set of sites partitions space into regions, one per site; the region for a site s consists of all points closer to s than to any other site. The dual of the Voronoi diagram, the Delaunay triangulation, is the unique triangulation so that the circumsphere of every triangle contains no sites in its interior. Voronoi diagrams and Delaunay triangulations have been rediscovered or applied in many areas of mathematics and the natural sciences; they are central topics in computational geometry, with hundreds of papers discussing algorithms and extensions. Section 2 discusses the definition and basic properties in the usual case of point sites in R d with the Euclidean metric, while section 3 gives basic algorithms. Some of the many extensions obtained by varying metric, sites, environment, and constraints are discussed in section 4. Section 5 finishes with some interesting and nonobvious