Conference PaperPDF Available

# A Cascaded Approach for Keyframes Extraction from Videos

Authors:

## Abstract and Figures

Keyframes extraction, a fundamental problem in video processing and analysis, has remained a challenge to date. In this paper, we introduce a novel method to effectively extract keyframes of a video. It consists of four steps. At first, we generate initial clips for the classified frames, based on consistent content within a clip. Using empirical evidence, we design an adaptive window length for the frame difference processing which outputs the initial keyframes then. We further remove the frames with meaningless information (e.g., black screen) in initial clips and initial keyframes. To achieve satisfactory keyframes, we finally map the current keyframes to the space of current clips and optimize the keyframes based on similarity. Extensive experiments show that our method outperforms to state-of-the-art keyframe extraction techniques with an average of $$96.84\%$$ on precision and $$81.55\%$$ on $$F_1$$.
Content may be subject to copyright.
Extraction from Videos ?
Yunhua Pei1, Zhiyi Huang1, Wenjie Yu1, Meili Wang123, and Xuequan Lu4
1College of Information Engineering, Northwest AF University, China.
2Key Laboratory of Agricultural Internet of Things, Ministry of Agriculture and
Rural Aﬀairs, Yangling, Shaanxi 712100, China.
3Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent
Service, Yangling 712100, China.
wml@nwsuaf.edu.cn
https://cie.nwsuaf.edu.cn/szdw/fjs/2012110003/
4Deakin University, 221 Burwood Highway, Burwood, Victoria 3125, Australia
xuequan.lu@deakin.edu.au
http://www.xuequanlu.com/
Abstract. Keyframes extraction, a fundamental problem in video pro-
cessing and analysis, has remained a challenge to date. In this paper,
we introduce a novel method to eﬀectively extract keyframes of a video.
It consists of four steps. At ﬁrst, we generate initial clips for the classi-
ﬁed frames, based on consistent content within a clip. Using empirical
evidence, we design an adaptive window length for the frame diﬀerence
processing which outputs the initial keyframes then. We further remove
the frames with meaningless information (e.g., black screen) in initial
clips and initial keyframes. To achieve satisfactory keyframes, we ﬁnally
map the current keyframes to the space of current clips and optimize
the keyframes based on similarity. Extensive experiments show that our
method outperforms to state-of-the-art keyframe extraction techniques
with an average of 96.84% on precision and 81.55% on F1.
Keywords: keyframe extraction ·frame diﬀerence ·image classiﬁcation
·video retrieval.
1 Introduction
Keyframes extraction, that is extracting keyframes from a video, is a fundamen-
tal problem in video processing and analysis. It has a lot of application ﬁelds like
video coding, so it is important to design robust and eﬀective keyframe extrac-
tion methods. Current methods are usually based on either pixel matrix or deep
learning classiﬁcation results [5, 6, 9, 10, 18, 19]. However, they still suﬀer from
some limitations. More speciﬁcally, the keyframe extraction techniques based on
pixel matrix are not capable of achieving decent accuracies, for example, when
?This is a preprint
2 Yunhua Pei et al.
handling news videos[16]. Nevertheless, the involved keyframes extraction could
take a considerable amount of time [15].
Motivated by the above issues, we propose a novel keyframe extraction ap-
proach in this paper. Given an input video, we ﬁrst turn it into frames and per-
form classiﬁcation with available deep learning networks. The classiﬁed frames
are split into initial clips, each of which has consistent content. We then de-
sign an adaptive window length for frame diﬀerence processing which takes the
computed initial clips as input and outputs. Also, we remove the frames with
meaningless information for previous results, such as black screen. Eventually,
to obtain desired keyframes, we map the current keyframes to the space of the
current clips and optimize the keyframes based on similarity.
Our method is simple yet eﬀective. It is elegantly built on top of deep learning
classiﬁcation and the frame diﬀerence processing. Experiments validate our ap-
proach and demonstrate that it outperforms or is comparable to state-of-the-art
keyframe extraction techniques. The main contributions of this paper are:
a novel robust keyframe extraction approach that ﬁts various types of videos;
the design of the adaptive window length and the removal of meaningless
frames;
a mapping scheme and an optimization method on determining keyframes.
Our source code will be released online.
2 Related Work
Keyframes extraction has been studied extensively. We only review researches
mostly relevant to our work. Please refer to [2, 12] for a comprehensive review.
Some researchers introduced a keyframe extraction method for human mo-
tion capture data, through exploiting the sparseness and Riemannian manifold
structure of human motion [17]. Guan et al. introduced two criteria, coverage
and redundancy, based on keypoint matching, to solve the keyframe selection
problem [3]. Kuanar et al. obtained keyframes with iterative edge pruning strat-
egy using dynamic Delon Diagram for clustering [5]. Mehmood et al. used both
viewer attention and aural attention to do the extraction [11].
Some researchers extract keyframes based on machine learning results. Yang
et al. used an unsupervised clustering algorithm to ﬁrst divide the frames, and
then selected keyframes from the clustering candidates [18]. Yong et al. extracted
keyframes by undergoing image segmentation, feature extraction and matching
of image blocks, and the construction of a co-occurrence matrix of semantic labels
[19]. Li et al. mapped the video data to a high-dimensional space and learnt a
new representation which could reﬂect the representativeness of the frame [7].
Although Motion capture and machine learning are eﬀective ways proved to
extract high-quality keyframes through existing researches, there is still no one
algorithm that can extract keyframes from these two perspectives at the same
time and suit each kind of videos.
A Cascaded Approach for Keyframes Extraction from Videos 3
3 Method
3.1 Overview
Our keyframe extraction approach consists of four steps which are speciﬁcally:
1. Initial clips generation. We ﬁrst obtain the classiﬁcation results and split the
classiﬁed frames into clips that respectively involve consistent content.
2. Adaptive Window Length for frame diﬀerence processing. We then design
an adaptive window length for the frame diﬀerence method which takes the
output of Step 1 as inputs and outputs the initial keyframes.
3. Meaningless frames removal. We reﬁne the results of Step 1 and 2 by remov-
ing the frames with meaningless information (e.g., black screen).
4. Mapping and optimization. After removing meaningless frames, we ﬁnally
map the current keyframes to the space of the current clips and perform
keyframes optimization based on similarity, to achieve more representative
keyframes.
3.2 Generating Initial Clips
ImageAI library [1] has four diﬀerent deep learning networks (Resnet50, DenseNet-
BC-121-32, Inception v3, Squeezenet) separately trained on the imagenet-1000
dataset. The four networks in this paper are using default parameter settings,
please reference [1] for more details. The classiﬁcation sof tmax (simax) is deﬁned
as
sof tmax (si) = esi
PN
i=1 esi
(i= 1, . . . , N ) (1)
where sirepresents the score of the input xon the ith category. After the calcu-
lation, we take the maximum recognition probability of an image as its label.
We turn the videos into consecutive frames which act as the input to the
networks. Since one object in diﬀerent scenes involve diﬀerent meaning, it is
necessary to treat them as independent scenes. Based on the image classiﬁcation
results, we simply split each video into a few clips, each of which continuously
represent certain content, by simply checking if the labels of the current and
next frames are the same or not.
The original frame diﬀerence method [13] simply sets the window length to 1
or a ﬁxed value, which is prone to generate undesired keyframes. To solve this
issue, introduce a formula to enable an adaptive length (L) calculation.
L=P(frame)
Expvalue (2)
where P(frame) is the total number of frames in the video, and Expvalue
denotes the expected number of keyframes.
We will describe how we achieve an adaptive threshold in experiments (Sec.
4.1), based on the videos.
4 Yunhua Pei et al.
3.4 Meaningless Frames Removal
Some consecutive frames may deliver little information, for example, black or
white or other fuzzy colors representing non-recognizable items. As such, it is
necessary to reﬁne the results in Sec. 3.2 and Sec. 3.3. We judge whether a frame
image has information or not by deﬁning as follows, where mount is the total
number of diﬀerent colors in a frame. The mount threshold (i.e., T) is empirically
set to 600 after testing on a hundred randomly selected 640x480 pixels pure color
imagines from the Internet and experiment videos.
P=1, mount T
0, others (3)
3.5 Mapping and Optimization
To obtain satisfactory results and reduce redundant results, we propose to map
the results by Sec. 3.3 onto the space of the results after Sec. 3.2. After mapping,
we perform the ﬁrst optimization by computing the average similarity of each
frame in a clip, and one with the highest average similarity is set as the keyframe
of this clip. Finally, if two keyframes of two consecutive clips that are generated
in Sec. 3.2 have a similarity above 50%, we conduct a second optimization by
simply choosing the ﬁrst keyframe and discarding the second keyframe. This is
because sometimes actual vidoes have fast switch of camera shots which leads
to repeated or similar scenes.
Similar to [4], the similarity is formulated as:
sim(G, S) = 1
M
M
X
i=1 1|gisi|
Max (gisi)(4)
where Gand Sare the values of the histogram after transforming the two images
into regular images, respectively. Mis the total number of samples in the color
space. More information can be referred to [14]. The expected keyframe is the
one with the greatest similarity, as follows arg maxk1
|C|−1Psim(Gk, S),
where (Gk, S) and |C|are a pair of two frames (not identical) and the number of
frames in the involved clip C, respectively. It is often not necessary to compare
the similarity between frames of diﬀerent clips due to the discontinuity and
dissimilarity.
4 Experiments
4.1 Experimental Setup
Data. A dataset of ten videos are used to validate the proposed method, which
Ground truth. Similar to previous research [7], three volunteers with multi-
media expertise independently selected and merged the keyframes of each video.
A Cascaded Approach for Keyframes Extraction from Videos 5
If some images deliver the same information, they are merged by manually se-
lecting the most representative one and discarding others. Table 1 illustrates an
example. We perform this operation on all the videos and the results are shown
as MK in Table 2.
Table 1: Ground truth keyframes generation example.
Volunteer A
Volunteer B
Volunteer C
Merged
We display the numbers of keyframes for initial keyframes and ﬁnal keyframes
based on the four classiﬁcation networks in Table 2. It can be seen from Table 2
that for each video, the numbers of ﬁnal keyframes decreased in general, which
indicates that Sec. 3.4 and Sec. 3.5 reﬁne the initial keyframes. Our method
runtime is also shown in Tab 2 which sees 0.66 − −6.79 times of the length of
the videos.
Table 2: Results and runtime by using diﬀerent networks (in seconds). MK:
merged keyframes. IK: initial keyframes. FK: ﬁnal keyframes. RT: runtime R:
Resnet50, D: DenseNet-BC-121-32, I: Inception v3 S: Squeezenet. AVG: average
Video MK IKRIKDIKIIKSFKRFKDFKIFKSRTRRTDRTIRTSRTAV G
Ads Audi(84s) 22 29 31 17 21 21 22 17 18 368.1 442.9 295.5 208.0 328.6
BBC...Camera(169s) 21 62 42 32 35 28 21 22 22 485.9 621.0 380.0 143.1 407.5
Highlight soccer(101s) 30 31 41 28 28 25 33 23 23 598.3 685.5 533.2 398.2 553.8
lion vs zebra judo(18s) 3 64 55 64 100 3 2 2 2 88.7 104.2 81.4 37.9 78.0
MV Gnstyle(252s) 41 36 46 30 30 31 40 26 26 426.0 518.8 350.0 166.7 365.4
Trailer...nonesub(161s) 46 37 23 30 26 18 15 18 16 284.5 315.8 225.5 115.3 235.3
Trailer...sub(147s) 45 47 63 54 63 31 38 34 40 606.9 652.3 490.7 392.9 535.7
Trailer Pokemon(196s) 72 85 76 65 75 66 63 56 66 622.8 770.6 498.3 287.5 544.8
UGS10 003(77s) 14 33 27 34 35 12 11 11 12 366.9 426.0 293.2 166.3 313.1
Dragons Fight(87s) 10 27 25 27 27 6 7 9 8 270.1 316.7 234.8 128.1 237.4
Experimental setting. Our framework is implemented in a Lenovo Y7000 lap-
top with an Intel(R) Core(TM) i5-9300H64 2.3GHz CPU and a NVIDIA GeForce
GTX 1050 Graphics card.
Window length. We take the separated clips of ResNet50 as an example, to
remove the short clips within a certain thresholding number. We found that the
6 Yunhua Pei et al.
remaining clips by setting this threshold to 10 can cover 90% of frames from the
whole video.
4.2 Quantitative Results
As with previous works [7, 8, 11], the precision P, the recall Rand the the average
(F1) of the Pand Rare employed as the evaluation metrics. They are computed
as:
P=Nc
Nc+Nf
×100% (5)
R=Nc
Nc+Nm
×100% (6)
F1=2RP
R+P×100% (7)
where Ncdenotes the number of correctly extracted keyframes, and Nfrefers
to the number of incorrect keyframes. Nmis the number of missing keyframes.
F1is a combined measure of Rand P, and a higher value indicates both higher
Rand P.
Fig.1 shows the evaluation numbers for each video using four diﬀerent net-
works, and we can observe that the accuracies of our method are high, except
for only a few outliers (e.g., R network for V2 and V5). The average precision
is 96.84%. Fig. 1(b) gives the recall numbers which are lower than precision
numbers. We suspect that it has two reasons. The ground truth set of a video
is simply removed high similiarity frames. Moreover, some blurry frames can
result in misclassiﬁcations and further lower numbers of keyframes. As a result,
Nc+Nmbecomes large and Ncbecomes small, thus leading to relatively low
recall numbers. Fig. 1(c) reﬂects that the overall performance is generally good,
with an average of 81.55% for all videos and 84.69% for videos except the outlier
video V7.
Table 3: Comparison with [7] and [8].
[7]
[8]
R
D
I
S
A Cascaded Approach for Keyframes Extraction from Videos 7
(a) Precision rate (b) Recall rate (c) F1 index
Fig. 1: Evaluation results. (a) Precision rate. (b) Recall rate. (c) F1 index.
D:DenseNet-BC-121-32, I:Inception v3, R:Resnet50, S:Squeezenet.
Table 4: Comparison with [11].
[11]
R
D
I
S
8 Yunhua Pei et al.
Besides the above experiments that validate our approach, we also compare
our method with state-of-the-art keframe extraction techniques [7, 8, 11]. Table 3
and 4 show some visual comparisons for our method and [7, 8, 11]. It can be seen
from Table 3 that our extracted keyframes are very similar to current techniques
[7, 8]. Furthermore, our method can extract more representative keyframes, in
terms of signiﬁcant distinctions and front views. Table 4 shows that our method
can extract fewer keyframes than [11] to describe the key content of the video.
While the results by [11] seem a bit redundant, in terms of scene keyframes. Their
scene keyframes occupied 36.4% while ours only took up 16.1%. This video is
actually concentrated more on humans than pure scenes.
In addition to visual comparisons, we also conduct quantitative comparisons
using the metrics mentioned above. The average P,Rand F1numbers are listed
in Tab. 5. Our average Pnumbers based on the four networks are the high-
est among all methods. Notice the numbers inside brackets are computed by
excluding the outlier video V7.
Table 5: Metrics comparison by using diﬀerent methods.
P R F1
[7] 87.5% 84.0% 85.7%
[8] 92.0% 87.8% 89.9%
[11] 90.0% 80.0% 84.7%
R 92.8%(93.3%) 80.2%(79.3%) 86.0%(82.5%)
D 93.9%(94.3%) 80.2%(85.3%) 86.5%(89.2%)
I 93.5%(93.8%) 76.1%(74.3%) 83.9%(82.3%)
S 95.7%(95.7%) 83.2%(80.4%) 89.0%(87.0%)
5 Conclusion
We have proposed a novel framework for extracting keyframes from videos. Var-
ious experiments demonstrate that our approach is eﬀective, and better or com-
parable to state-of-the-art methods.
One limitation is that it is challenging to classify burred images for exist-
ing deep learning networks, thus leading to undesired keyframes for videos with
frequent blur. As the future work, we would like to investigate and solve this
limitation, for example, by incorporating deblurring techniques into our frame-
work.
Acknowledgement
This work was partially funded by Key Laboratory of Agricultural Internet of
Things, Ministry of Agriculture and Rural Aﬀairs, Yangling, Shaanxi 712100,
China (2018AI OT 09). National Natural Science Foundation of China (61702433),
Key Research and Development Program of Shaanxi Province (2018NY 127).
A Cascaded Approach for Keyframes Extraction from Videos 9
References
1. open source python library built to empower developers to build ap-
plications and systems with self-contained computer vision capabilities,
https://github.com/OlafenwaMoses/frameAI
2. Asghar, M.N., Hussain, F., Manton, R.: Video indexing: a survey. International
Journal of Computer and Information Technology 3(01) (2014)
3. Guan, G., Wang, Z., Lu, S., Deng, J.D., Feng, D.D.: Keypoint-based keyframe
selection. IEEE Transactions on Circuits and Systems for Video Technology 23(4),
729–734 (April 2013). https://doi.org/10.1109/TCSVT.2012.2214871
4. Jiang, L., Shen, G., Zhang, G.: An image retrieval algorithm based on hsv color
segment histograms. Mechanical & Electrical Engineering Magazine 26(11), 54–57
(2009)
5. Kuanar, S.K., Panda, R., Chowdhury, A.S.: Video key frame extraction through
dynamic delaunay clustering with a structural constraint. Journal of Visual Com-
munication and Image Representation 24(7), 1212–1227 (2013)
6. Kulhare, S., Sah, S., Pillai, S., Ptucha, R.: Key frame extraction for salient ac-
tivity recognition. In: 2016 23rd International Conference on Pattern Recognition
(ICPR). pp. 835–840. IEEE (2016)
7. Li, X., Zhao, B., Lu, X.: Key frame extraction in the summary space. IEEE trans-
actions on cybernetics 48(6), 1923–1934 (2017)
8. Liu, H., Li, T.: Key frame extraction based on improved frame blocks features and
second extraction. In: 2015 12th International Conference on Fuzzy Systems and
Knowledge Discovery (FSKD). pp. 1950–1955. IEEE (2015)
9. Liu, H., Meng, W., Liu, Z.: Key frame extraction of online video based on optimized
frame diﬀerence. In: 2012 9th International Conference on Fuzzy Systems and
Knowledge Discovery. pp. 1238–1242. IEEE (2012)
10. Luo, Y., Zhou, H., Tan, Q., Chen, X., Yun, M.: Key frame extraction of surveillance
video based on moving object detection and image similarity. Pattern Recognition
and Image Analysis 28(2), 225–231 (2018)
11. Mehmood, I., Sajjad, M., Rho, S., Baik, S.W.: Divide-and-conquer based summa-
rization framework for extracting aﬀective video content. Neurocomputing 174,
393–403 (2016)
12. Milan Kumar Asha Paul, J.K., Rani, P.A.J.: Key-frame extraction tech-
niques: A review. Recent Patents on Computer Science 11(1), 3–16 (2018).
https://doi.org/10.2174/2213275911666180719111118
13. Singla, N.: Motion detection based on frame diﬀerence method. International Jour-
nal of Information & Computation Technology 4(15), 1559–1565 (2014)
14. Swain, M.J., Ballard, D.H.: Indexing via color histograms. In: Active perception
and robot vision, pp. 261–273. Springer (1992)
15. Tang, H., Zhou, J.: Method for extracting the key frame of various types video
based on machine learning. Industrial Control Computer 3, 94–95 (2014)
16. Wang, S., Han, Y., Yadong, W.U., Zhang, S.: Video key frame extraction method
based on image dominant color. Journal of Computer Applications 33(9), 2631–
2635 (2013)
17. Xia, G., Sun, H., Niu, X., Zhang, G., Feng, L.: Keyframe extraction for human mo-
tion capture data based on joint kernel sparse representation. IEEE Transactions
on Industrial Electronics 64(2), 1589–1599 (2016)
18. Yang, S., Lin, X.: Key frame extraction using unsupervised clustering based on a
statistical model. Tsinghua Science & Technology 10(2), 169–173 (2005)
10 Yunhua Pei et al.
19. Yong, S.P., Deng, J.D., Purvis, M.K.: Wildlife video key-frame extraction based on
novelty detection in semantic context. Multimedia Tools and Applications 62(2),
359–376 (2013)
... The newly proposed methods typically aim at maintaining the sharp features of the original point cloud while projecting the noisy points to underlying surfaces. The filtered point cloud data can then be used for upsampling [12], surface reconstruction [13,27], skeleton learning [21,22] and computer animation [24,28], etc. ...
Preprint
Full-text available
As a popular representation of 3D data, point cloud may contain noise and need to be filtered before use. Existing point cloud filtering methods either cannot preserve sharp features or result in uneven point distribution in the filtered output. To address this problem, this paper introduces a point cloud filtering method that considers both point distribution and feature preservation during filtering. The key idea is to incorporate a repulsion term with a data term in energy minimization. The repulsion term is responsible for the point distribution, while the data term is to approximate the noisy surfaces while preserving the geometric features. This method is capable of handling models with fine-scale features and sharp features. Extensive experiments show that our method yields better results with a more uniform point distribution ($5.8\times10^{-5}$ Chamfer Distance on average) in seconds.
... A summary was produced by merging the selected keyframes with the highest coefficient of variation from every video shot. In Pei et al. (2020), the authors proposed to use adaptive window length for frame difference processing to get initial keyframes. Then meaningless frames are removed, and optimization is performed based on similarity for keyframe extraction. ...
Article
Full-text available
With the explosive advancements in smartphone technology, video uploading/downloading has become a routine part of digital social networking. Video contents contain valuable information as more incidents are being recorded now than ever before. In this paper, we present a comprehensive survey on information extraction from video contents and forgery detection. In this context, we review various modern techniques such as computer vision and different machine learning (ML) algorithms including deep learning (DL) proposed for video forgery detection. Furthermore, we discuss the persistent general, resource, legal, and technical challenges, as well as challenges in using DL for the problem at hand, such as the theory behind DL, CV, limited datasets, real-time processing, and the challenges with the emergence of ML techniques used with the Internet of Things (IoT)-based heterogeneous devices. Moreover, this survey presents prominent video analysis products used for video forensics investigation and analysis. In summary, this survey provides a detailed and broader investigation about information extraction and forgery detection in video contents under one umbrella, which was not presented yet to the best of our knowledge.
Conference Paper
Full-text available
Article
Background: The massive database of videos is growing day by day in this era. Analyzing such huge data is always a time-consuming process. The effective use of video content requires a user-friendly access to information. This leads to the evolution of the research area known as video summarization. The effective techniques of video summarization, the videos have let to analyze the content of large volumes of digital video sequences in various categories, such as surveillance, documentaries, movies, sports, lectures, and news. In video summarization, the automatic selection of necessary and informative section from videos using accurate algorithms is essential. The keyframe extraction in video summarization is intended to suffice comprehensive analysis of video by eliminating replications and extraction of keyframes from the video. Methods: Recent keyframe extraction techniques like clustering, shot, visual content based keyframe extraction methods are discussed for effective keyframe extraction. Results: First an introduction of various techniques for keyframe extraction pursued by the state-of-the-art review on their properties. Although we have outlined some ideas for effective evaluation of video keyframes, the analytical evaluation of various keyframe extraction techniques is discussed and the approaches based on the methods, dataset and the results are compared. Conclusion: In the recent years, the use of digital video data has been increasing significantly due to the extensive use of multimedia applications in the areas of education, entertainment, business. So the video has received an incredible attention and research interest in video processing. The use of keyframe extraction has been given incredible attention, in this work, we have carried out a comprehensive survey and review of the research in keyframe extraction techniques. We believe the review paper will provide an update for the reader regarding the progress of keyframe extraction by different keyframe extraction techniques.
Article
For the traditional method to extract the surveillance video key frame, there are problems of redundant information, substandard representative content and other issues. A key frame extraction method based on motion target detection and image similarity is proposed in this paper. This method first uses the ViBe algorithm fusing the inter-frame difference method to divide the original video into several segments containing the moving object. Then, the global similarity of the video frame is obtained by using the peak signal to noise ratio, the local similarity is obtained through the SURF feature point, and the comprehensive similarity of the video image is obtained by weighted fusion of them. Finally, the key frames are extracted from the critical video sequence by adaptive selection threshold. The experimental results show that the method can effectively extract the video key frame, reduce the redundant information of the video data, and express the main content of the video concisely. Moreover, the complexity of the algorithm is not high, so it is suitable for the key frame extraction of the surveillance video.
Article
Key frame extraction is an efficient way to create the video summary which helps users obtain a quick comprehension of the video content. Generally, the key frames should be representative of the video content, meanwhile, diverse to reduce the redundancy. Based on the assumption that the video data are near a subspace of a high-dimensional space, a new approach, named as key frame extraction in the summary space, is proposed for key frame extraction in this paper. The proposed approach aims to find the representative frames of the video and filter out similar frames from the representative frame set. First of all, the video data are mapped to a high-dimensional space, named as summary space. Then, a new representation is learned for each frame by analyzing the intrinsic structure of the summary space. Specifically, the learned representation can reflect the representativeness of the frame, and is utilized to select representative frames. Next, the perceptual hash algorithm is employed to measure the similarity of representative frames. As a result, the key frame set is obtained after filtering out similar frames from the representative frame set. Finally, the video summary is constructed by assigning the key frames in temporal order. Additionally, the ground truth, created by filtering out similar frames from human-created summaries, is utilized to evaluate the quality of the video summary. Compared with several traditional approaches, the experimental results on 80 videos from two datasets indicate the superior performance of our approach.
Article
Human motion capture data, which is used to animate animation characters, has been widely used in many areas. To satisfy the high-precision requirement, human motion data is captured with a high frequency (120 frames/sec) by a high-precision capture system. However, the high frequency and nonlinear structure make the s- torage, retrieval, and browsing of motion data challenging problems, which can be solved by keyframe extraction. Cur- rent keyframe extraction methods do not properly model two important characteristics of motion data, i.e. sparse- ness and Riemannian manifold structure. Therefore, we propose a new model called Joint Kernel Sparse Represen- tation, which is in marked contrast to all current keyframe extraction methods for motion data and can simultaneously model the sparseness and the Riemannian manifold struc- ture. The proposed model completes the sparse represen- tation in a kernel-induced space with a geodesic exponen- tial kernel, while the traditional sparse representation (SR) cannot model the nonlinear structure of motion data in Euclidean space. Meanwhile, because of several important modifications to traditional SR, our model can also exploit the relations between joints and solve two problems, i.e. the unreasonable distribution and redundancy of extracted keyframes, which current methods do not solve. Extensive experiments demonstrate the effectiveness of the proposed method.
Article
Recent advances in multimedia technology have led to tremendous increases in the available volume of video data, thereby creating a major requirement for efficient systems to manage such huge data volumes. Video summarization is one of the key techniques for accessing and managing large video libraries. Video summarization can be used to extract the affective contents of a video sequence to generate a concise representation of its content. Human attention models are an efficient means of affective content extraction. Existing visual attention driven summarization frameworks have high computational cost and memory requirements, as well as a lack of efficiency in accurately perceiving human attention. To cope with these issues, we propose a divide-and-conquer based framework for an efficient summarization of big video data. We divide the original video data into shots, where an attention model is computed from each shot in parallel. Viewer's attention is based on multiple sensory perceptions, i.e., aural and visual, as well as the viewer's neuronal signals. The aural attention model is based on the Teager energy, instant amplitude, and instant frequency, whereas the visual attention model employs multi-scale contrast and motion intensity. Moreover, the neuronal attention is computed using the beta-band frequencies of neuronal signals. Next, an aggregated attention curve is generated using an intra- and inter-modality fusion mechanism. Finally, the affective content in each video shot is extracted. The fusion of multimedia and neuronal signals provides a bridge that links the digital representation of multimedia with the viewer’s perceptions. Our experimental results indicate that the proposed shot-detection based divide-and-conquer strategy mitigates the time and computational complexity. Moreover, the proposed attention model provides an accurate reflection of the user preferences and facilitates the extraction of highly affective and personalized summaries.
Conference Paper
Key frame extraction is a basic technology of online video retrieval and abstract establishing. Efficient key frame extraction technology can promote the development of widely-used video browsing technology. In this paper, we first reviewed some commonly-used key frame extraction technology, and then proposed a key frame extraction technology based on optimized frame difference, which measures the similarity of two adjacent frames' contents in terms of the information of frame difference, and extracts key frames after optimizing the frame difference. The experiment results show that the proposed algorithm can extract key information of online video efficiently, and has a good description in the aspect of changes in the movement of the lens. Keywords—key frame extraction; inter-frame difference; online video; frame average method; fuzzy cluster