Figure 2 - uploaded by Roberto Valenti
Content may be subject to copyright.
Source publication
This paper presents an approach to aective video summari- sation based on the facial expressions (FX) of viewers. A fa- cial expression recognition system was deployed to capture a viewer's face and his/her expressions. The user's facial ex- pressions were analysed to infer personalised aective scenes from videos. We proposed two models, pronounced...
Context in source publication
Context 1
... should be noted that all video clips were new to the participants. The content video and the recording of facial expressions were synchronised for subsequent analy- sis (See Figure 2). The FX (facial expression) videos were exported to 360x240 pixels AVI format with 25 frames per second (same as the video clips). ...
Similar publications
The automatic analysis of emotion remains a challenging task in unconstrained experimental conditions. In this paper, we present our contribution to the 6th Audio/Visual Emotion Challenge (AVEC 2016), which aims at predicting the continuous emotional dimensions of arousal and valence. First, we propose to improve the performance of the multimodal p...
High-level manipulation of facial expressions in images such as expression synthesis is challenging because facial expression changes are highly non-linear, and vary depending on the facial appearance. Identity of the person should also be well preserved in the synthesized face. In this paper, we propose a novel U-Net Conditioned Generative Adversa...
In the current study we explored the role of facial expression, clothing color and stereotype conformity in Likability and competence attribution and corporate collaborator choice.
This work proposes a framework for facial expression recognition based on generalized procrustes analysis. The proposed system classifies seven different facial expressions: happiness, anger, sadness, surprise, disgust, fear and neutral. The proposed system was evaluated with the MUG Facial Expression database. Experimental results shows that the p...
Citations
... • Affect: The affectiveness of the video can be measured by considering the emotional impact of the video on viewers. These approaches use human gaze [13], and viewer behavior [14] to extract affective segments of the video [76]. Some of the approaches also utilize psychological factors to model the affective contents of the video. ...
Since the last decade, the diverse applications of video summarization have gained increased attention, motivating researchers in the domain of computer vision to generate optimal and comprehensible video summaries. The main challenge in the research of video summarization is user perception and preference as humans are the ultimate consumers of generated summary. A single video summary cannot satisfy all users unless the summarization algorithm interacts with end users and adapts to their requirements. Conventional video summarization can not tackle the user requirements. This study explores various state-of-the-art techniques developed for generating user-intended video summaries, focusing on query-attentive video summarization. Query-attentive video summarization is a multi-modal summarization method that generates a video summary that satisfies the viewer’s requirements by taking input queries from the viewers. This paper discusses the fundamental aspects of query-attentive video summarization, tracing its progress and evolution over time. Contemporary approaches are explored in detail, highlighting developed techniques with advantages and limitations. Additionally, the article also studies publicly available datasets, including extensively utilized Query-Focused Video Summarization dataset, since these datasets ensure the validity and applicability of developed techniques. Evaluation metrics, which are essential tools for measuring performance and assessing user satisfaction are also studied and performance comparisons are presented. After investigating the domain of query-attentive video summarization, this article addresses the current research challenges and identifies potential future research objectives. This comprehensive review offers a complete guide for new researchers in the field of query-attentive video summarization, covering both existing and future real-time applications.
... The query can be expressed as textual description of some object or event, image, video segments or keywords and the summary is generated based on the query. In non-interactive or perception-based video summarization, instead of using an additional input such as a query, importance score generation is based on cues such as attention, emotions, facial expressions etc. [10][11][12][13]. Video summarization techniques follow either an unsupervised or a supervised approach. ...
Video summarization extracts the relevant contents from a video and presents the entire content of the video in a compact and summarized form. User based video summarization, can summarize a video as per the requirement of the user. In this work, a non interactive and a perception-based video summarization technique is proposed that makes use of attention mechanism to capture user’s interest and extract relevant keyshots in temporal sequence from the video content. Here, video summarization has been articulated as a sequence-to-sequence learning problem and a supervised method has been proposed for summarization of the video. Adding layers to the existing network makes it deeper, enables higher level of abstraction and facilitates better feature extraction. Therefore, the proposed model uses a multi-layered, deep summarization encoder-decoder network (MLAVS), with attention mechanism to select final keyshots from the video. The contextual information of the video frames is encoded using a multi-layered Bidirectional Long Short-Term Memory network (BiLSTM) as the encoder. To decode, a multi-layered attention-based Long Short-Term memory (LSTM) using a multiplicative score function is employed. The experiments are performed on the benchmark TVSum dataset and the results obtained are compared with recent works. The results show considerable improvement and clearly demonstrate the efficacy of this methodology against most of the other available state-of-art methods.
... Observations The performance of user behavior-based methods is majorly impacted by the accuracy in acquisition of user behavioral information. This is especially true for single modality approaches (Joho and Jose 2009;Katti et al. 2011;Money and Agius 2010;Paul & Musfequs Salehin, 2019;Peng et al. 2011;Qayyum et al. 2019). The multi-modal approaches (Han et al. 2014;Mehmood et al. 2016;Wu et al. 2018;Xu et al. 2015) result in robust summaries. ...
... al. 2011; Money and Agius 2010), psychological (EEG/fMRI, face expression)(Joho and Jose 2009;Qayyum et al. 2019) or behavioral factors (gaze information, head movement)(Han et al. 2014;Mehmood et al. 2016;Paul & Musfequs Salehin, 2019;Peng et al. 2011;Wu et al. 2018;Xu et al. 2015) for detection of important and user-relevant video key-frames/key-shots. Under this category, some approaches are solely based on reaction features(Katti et al. 2011;Money and Agius 2010;Joho and Jose 2009;Qayyum et al. 2019;Peng et al. 2011;Xu et al. 2015) while other adopted multimodal techniques where video internal components (visual, audio, text) are leveraged along with external viewer-specific information(Mehmood et al. 2016;(Paul & Musfequs Salehin, 2019);Wu et al. 2018;Han et al. 2014). ...
... al. 2011; Money and Agius 2010), psychological (EEG/fMRI, face expression)(Joho and Jose 2009;Qayyum et al. 2019) or behavioral factors (gaze information, head movement)(Han et al. 2014;Mehmood et al. 2016;Paul & Musfequs Salehin, 2019;Peng et al. 2011;Wu et al. 2018;Xu et al. 2015) for detection of important and user-relevant video key-frames/key-shots. Under this category, some approaches are solely based on reaction features(Katti et al. 2011;Money and Agius 2010;Joho and Jose 2009;Qayyum et al. 2019;Peng et al. 2011;Xu et al. 2015) while other adopted multimodal techniques where video internal components (visual, audio, text) are leveraged along with external viewer-specific information(Mehmood et al. 2016;(Paul & Musfequs Salehin, 2019);Wu et al. 2018;Han et al. 2014). These approaches can be broadly classified into three categories based on methodology: conventional (non-learning), machine learning and deep-learning.Convention Methods Joho and Jose (2009) used facial activities, in particular, prominence and frequency of viewer's expression for shot importance estimation. ...
Video summarization deals with the generation of a condensed version of the original video by including meaningful frames or segments while eliminating redundant information. The main challenge in a video summarization task is to identify important frames or segments corresponding to human perception which varies from one genre to another. In the past two decades, several summarization techniques ranging from conventional non-learning to deep learning based mechanisms have been developed. This study provides a comprehensive survey focusing on the massive literature with scope ranging from general to domain specific methods, single view to multi-view processes, generic to user-interaction based mechanisms and conventional to deep learning-based approaches. The presented work provides general pipeline and broad classification of video summarization systems. The survey also presents genre-wise datasets description, various evaluation techniques and future recommendations. The key-points of presented work lie in its approach of analyzing literature in a systematic manner and its wide coverage by including some of the domains that have been overlooked over the time like aerial videos, medical videos and user-customization based approaches. The research work in each category is investigated, compared and analyzed on the basis of various intrinsic characteristics. The main objective of this manuscript is to guide future researchers about state-of-the-art work done in various domains of the video summarization field, so that the scope and performance of automatic video summarization systems can be enhanced further by designing new approaches or by improving different existing techniques.
... Methods in Peng et al. (2010Peng et al. ( , 2011) presents a psychometric model to summarize a video via psychological states and humans actions. Joho et al. (2009) discussed a facial expressions-based method that utilized motion units to determine 3 face motion from 2D points that are extricated from faces. Yoshitaka and Sawada (2012) and Xu et al. (2015) discussed an approach that produce a personalized summary using eye gaze information. ...
The exponential growth of technology has resulted in a profusion of advanced imaging devices and eases internet accessibility, leading to an increase in the creation and use of multimedia content. Analyzing representative or meaningful information from such massive data is a time-consuming task that impacts the efficiency of various video processing applications, including video searching, retrieval, indexing, sharing, and many more. In literature, numerous video summarization techniques which extract key-frames or key-shots from the original video to generate a concise yet informative summary have been proposed to address these issues. This paper presents a discussion of the state-of-the-art video summarization techniques along with limitations and challenges. The paper examines summarization techniques in a holistic manner based upon the distinct attributes of evolving video data types on the basis of parameters such as the number of views, dimensions, modality, and content. Such a categorization framework enables us to critically analyze the recent progress, future directions, limitations, datasets, application domains etc., in a better comprehensible manner.
... video summarization [5], ii) security applications and business negotiation requiring lie detection for preventing frauds [6], and iii) monitoring the suspicious intent for psychotherapy [7]. All these fields can be benefited when the video frames containing ME can be correctly analyzed. ...
Micro-expression (ME) is required in real-world applications for understanding true human feeling. The preliminary step of ME analysis, ME spotting, is highly challenging for human experts because MEs induce subtle facial movements for a short duration. Moreover, the existing feature encodings are insufficient for spotting because they are affected by illumination and eye-blinking. These issues are alleviated for better ME spotting by our proposed method, PERSIST, that is, imProved fEatuRe encodingS and multIscale gauSsian Temporal convolutional network. It investigates the possibility of human gaze deformations for spotting. In contrast to the well-known sequence models like RNN and LSTM, it explores the feasibility of a temporal convolutional network to model long-term dependencies in a better way. Furthermore, the proposed network efficacy is significantly improved by adding a Gaussian filter layer and performing multi-resolution analysis. Experimental results conducted on publicly available ME spotting databases reveal that our method PERSIST outperforms the well-known methods. It also indicates that eyebrow information is helpful in ME spotting when eye-blinking artifacts are mitigated, and human gaze information can be consolidated with other encodings for performance improvement.
... For instance, the geographical location, time, environment of capturing or viewing the video, can be considered as contextual information which can assist in the summarization process. The articles [8,16,37,45,53,57,60,61,71,85,102,123] are based on the criteria of exploiting contextual information from a video stream for the summarization process. ...
... The perception-based summaries are based on how the user perceives the content of a video. It is generally based on the visual attention, level of excitement while viewing the video, emotions, level of importance that the user associates with a content, facial expressions [45,86] etc. Such summaries focus on inferring the perception of the user rather than the objects, events or semantics of a video. ...
The volume of video data generated has seen an exponential growth over the years and video summarization has emerged as a process that can facilitate efficient storage, quick browsing, indexing, fast retrieval and quick sharing of the content. In view of the vast literature available on different aspects of video summarization approaches and techniques, a need has arisen to summarize and organize various recent research findings, future research focus and trends, challenges, performance measures and evaluation and datasets for testing and validations. This paper investigates into the existing video summarization frameworks and presents a comprehensive view of the existing approaches and techniques. It highlights the recent advances in the techniques and discusses the paradigm shift that has occurred over the last two decades in the area, leading to considerable improvement. Attempts are made to consolidate the most significant findings right from the basic summarization structure to the classification of summarization techniques and noteworthy contributions in the area. Additionally, the existing datasets categorized domain-wise for the purpose of video summarization and evaluation are enumerated. The present study would be helpful in: assimilating important research findings and data for ready reference, identifying groundwork and exploring potential directions for further research.
... Javier et al. proposed a system that displays facial expressions on a map in real time via identification of facial expressions through an installed camera [18]. Joho et al. proposed a method for determining the affective highlights of videos on the basis of the recorded facial expressions [19]. Ahuja et al. proposed a system that extends the range of expression in VR/AR using a device that combines a VR/AR headset, hemispherical mirrors, and a smartphone camera [20]. ...
... In the field of video summarization, external webcams are utilized to detect user's facial expressions [2,10,11], eye blinks, eye movements, and head motion [2]. These works require the use of fixed cameras and suffer from immobility, significantly restricting the range of potential applications. ...
... The requirement of using obtrusive and non-wearable sensors to extract a user's emotions has encouraged most research to focus on direct affective content utilizing video frames [12][13][14][15], video segments [16][17][18][19], audio and text features [14]. The present results remained poor on average due to the semantic gap problem [10] of detected objects or events. ...
... Although most of the works such as [41] use external affective content for focused applications such as emotion tagging in video, Joho et al. [10,11] utilized the viewer's facial expressions for video summarization. Despite the low performance of their model, they are able to use an unobtrusive sensor to retrieve affective content. ...
In this paper, we present HOMER, a cloud-based system for video highlight generation which enables the automated, relevant, and flexible segmentation of videos. Our system outperforms state-of-the-art solutions by fusing internal video content-based features with the user’s emotion data. While current research mainly focuses on creating video summaries without the use of affective data, our solution achieves the subjective task of detecting highlights by leveraging human emotions. In two separate experiments, including videos filmed with a dual camera setup, and home videos randomly picked from Microsoft’s Video Titles in the Wild (VTW) dataset, HOMER demonstrates an improvement of up to 38% in F1-score from baseline, while not requiring any external hardware. We demonstrated both the portability and scalability of HOMER through the implementation of two smartphone applications.
... The chosen frames are then refined to bypass any insignificant frames in the video summary. Joho et al. [18] followed a summarization process based on the viewer's facial expression. A video summarization based on pupillary dilation and eye gaze was proposed by Katti et al. [19]. ...
This paper proposes an efficient video summarization framework that will give a gist of the entire video in a few key-frames or video skims. Existing video summarization frameworks are based on algorithms that utilize computer vision low-level feature extraction or high-level domain level extraction. However, being the ultimate user of the summarized video, humans remain the most neglected aspect. Therefore, the proposed paper considers human's role in summarization and introduces human visual attention-based summarization techniques. To understand human attention behavior, we have designed and performed experiments with human participants using electroencephalogram (EEG) and eye-tracking technology. The EEG and eye-tracking data obtained from the experimentation are processed simultaneously and used to segment frames containing useful information from a considerable video volume. Thus, the frame segmentation primarily relies on the cognitive judgments of human beings. Using our approach, a video is summarized by 96.5% while maintaining higher precision and high recall factors. The comparison with the state-of-the-art techniques demonstrates that the proposed approach yields ceiling-level performance with reduced computational cost in summarising the videos.
... Emotion recognition is relevant in many computing areas that take into account the affective state of the user, such as human-computer interaction [1], human-robot interaction [2], music and image recommendation [3], affective video summarization [4], and personal wellness and assistive technologies [5]. Although emotion recognition is an interesting problem, it is also very challenging unless the recording conditions are well controlled. ...
This paper presents an audiovisual-based emotion recognition hybrid network. While most of the previous work focuses either on using deep models or hand-engineered features extracted from images, we explore multiple deep models built on both images and audio signals. Specifically, in addition to convolutional neural networks (CNN) and recurrent neutral networks (RNN) trained on facial images, the hybrid network also contains one SVM classifier trained on holistic acoustic feature vectors, one long short-term memory network (LSTM) trained on short-term feature sequences extracted from segmented audio clips, and one Inception(v2)-LSTM network trained on image-like maps, which are built based on short-term acoustic feature sequences. Experimental results show that the proposed hybrid network outperforms the baseline method by a large margin.