Chapter

Automated Nonverbal Cue Detection in Political-Debate Videos: An Optimized RNN-LSTM Approach

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This study proposes a computational-video-analysis pipeline using OpenPose for keypoint detection, the RNN-LSTM network for constructing 12 gesture classifiers, and data augmentation and epoch early-stopping techniques for performance optimization. Through the measurement of accuracy, precision, recall, and F1 scores, this study compares three approaches (the vanilla approach, data-augmentation approach, and epoch-optimization approach), which gradually increase the model performance for all gesture features. The study suggests that a combination of data augmentation and epoch early-stopping techniques can effectively solve the imbalanced dataset problem faced by customized datasets and substantially increase the accuracy and F1 scores by 10–20%, achieving a satisfying accuracy of 70%–90% for most gesture detections.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Because of their effectiveness in broad practical applications, LSTM networks have received a wealth of coverage in scientific journals, technical blogs, and implementation guides. However, in most articles, the inference formulas for the LSTM network and its parent, RNN, are stated axiomatically, while the training formulas are omitted altogether. In addition, the technique of “unrolling” an RNN is routinely presented without justification throughout the literature. The goal of this tutorial is to explain the essential RNN and LSTM fundamentals in a single document. Drawing from concepts in Signal Processing, we formally derive the canonical RNN formulation from differential equations. We then propose and prove a precise statement, which yields the RNN unrolling technique. We also review the difficulties with training the standard RNN and address them by transforming the RNN into the “Vanilla LSTM”¹ network through a series of logical arguments. We provide all equations pertaining to the LSTM system together with detailed descriptions of its constituent entities. Albeit unconventional, our choice of notation and the method for presenting the LSTM system emphasizes ease of understanding. As part of the analysis, we identify new opportunities to enrich the LSTM system and incorporate these extensions into the Vanilla LSTM network, producing the most general LSTM variant to date. The target reader has already been exposed to RNNs and LSTM networks through numerous available resources and is open to an alternative pedagogical approach. A Machine Learning practitioner seeking guidance for implementing our new augmented LSTM model in software for experimentation and research will find the insights and derivations in this treatise valuable as well.
Article
Full-text available
Abstract Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data. Unfortunately, many application domains do not have access to big data, such as medical image analysis. This survey focuses on Data Augmentation, a data-space solution to the problem of limited data. Data Augmentation encompasses a suite of techniques that enhance the size and quality of training datasets such that better Deep Learning models can be built using them. The image augmentation algorithms discussed in this survey include geometric transformations, color space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning. The application of augmentation methods based on GANs are heavily covered in this survey. In addition to augmentation techniques, this paper will briefly discuss other characteristics of Data Augmentation such as test-time augmentation, resolution impact, final dataset size, and curriculum learning. This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing Data Augmentation. Readers will understand how Data Augmentation can improve the performance of their models and expand limited datasets to take advantage of the capabilities of big data.
Article
Full-text available
Image Bite Politics is the first book to systematically assess the visual presentation of presidential candidates in network news coverage of elections and to connect these visual images with shifts in public opinion. Presenting the results of a comprehensive visual analysis of general election news from 1992-2004, encompassing four presidential campaigns, the authors highlight the remarkably potent influence of television images when it comes to evaluating leaders. The book draws from a variety of disciplines, including political science, behavioral biology, cognitive neuroscience, and media studies in order to investigate the visual framing of elections in an interdisciplinary fashion. Moreover, the book presents findings that are counterintuitive and which challenge widely held assumptions; yet are supported by systematic data. For example, Republicans receive consistently more favorable visual treatment than Democrats, countering the conventional wisdom of a "liberal media bias"; and image bites are more prevalent, and in some elections more potent, in shaping voter opinions of candidates than sound bites. Finally, the authors provide a foundation for promoting visual literacy among news audiences and bring the importance of visual analysis to the forefront of research. © 2009 by Maria Elizabeth Grabe and Erik Page Bucy. All rights reserved.
Article
Full-text available
Validation can be used to detect when over#tting starts during supervised training of a neural network; training is then stopped before convergence to avoid the over#tting ##early stopping"#. The exact criterion used for validation-based early stopping, however, is usually chosen in an ad-hoc fashion or training is stopped interactively. This trick describes how to select a stopping criterion in a systematic fashion; it is a trick for either speeding learning procedures or improving generalization, whichever is more important in the particular situation. An empirical investigation on multi-layer perceptrons shows that there exists a tradeo# between training time and generalization: From the given mix of 1296 training runs using di#erent 12 problems and 24 di#erent network architectures I conclude slower stopping criteria allow for small improvements in generalization #here: about 4# on average#, but cost much more training time #here: about factor 4 longer on average#.
Preprint
Voters evaluate politicians not just by what they say, but also how they say it, via facial displays of emotions and vocal pitch. Candidate characteristics can shape how leaders use – and how voters react to – nonverbal cues. Drawing on role congruity expectations, we focus on how gender shapes the use of and reactions to facial, voice, and textual communication in political debates. Using full-length debate videos from four German national elections (2005–2017) and a minor debate in 2017, we employ computer vision, machine learning, and text analysis to extract facial displays of emotion, vocal pitch, and speech sentiment. Consistent with our expectations, Angela Merkel expresses less anger and is less emotive than her male opponents. We combine second-by-second candidate emotions data with continuous responses recorded by live audiences. We find that voters punish Merkel for anger displays and reward her happiness and general emotional displays.
Article
Populism, as many have observed, is a communication phenomenon as much as a coherent ideology whose mass appeal stems from the fiery articulation of core positions, notably hostility toward “others,” bias against elites in favor of “the people,” and the transgressive delivery of those messages. Yet much of what we know about populist communication is based on analysis of candidate pronouncements, the verbal message conveyed at political events and over social media, rather than transgressive performances—the visual and tonal markers of outrage—that give populism its distinctive flair. The present study addresses this gap in the literature by using detailed verbal, tonal, and nonverbal coding of the first US presidential debate of 2016 between Donald Trump and Hillary Clinton to show how Trump’s transgressive style—his violation of normative boundaries, particularly those related to protocol and politeness, and open displays of frustration and anger—can be operationalized from a communication standpoint and used in statistical modeling to predict the volume of Twitter response to both candidates during the debate. Our findings support the view that Trump’s norm-violating transgressive style, a type of political performance, resonated with viewers significantly more than Clinton’s more controlled approach and garnered Trump substantial second-screen attention.
Article
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Article
Affect detection is an important pattern recognition problem that has inspired researchers from several areas. The field is in need of a systematic review due to the recent influx of Multimodal (MM) affect detection systems that differ in several respects and sometimes yield incompatible results. This article provides such a survey via a quantitative review and meta-analysis of 90 peer-reviewed MM systems. The review indicated that the state of the art mainly consists of person-dependent models (62.2% of systems) that fuse audio and visual (55.6%) information to detect acted (52.2%) expressions of basic emotions and simple dimensions of arousal and valence (64.5%) with feature- (38.9%) and decision-level (35.6%) fusion techniques. However, there were also person-independent systems that considered additional modalities to detect nonbasic emotions and complex dimensions using model-level fusion techniques. The meta-analysis revealed that MM systems were consistently (85% of systems) more accurate than their best unimodal counterparts, with an average improvement of 9.83% (median of 6.60%). However, improvements were three times lower when systems were trained on natural (4.59%) versus acted data (12.7%). Importantly, MM accuracy could be accurately predicted (cross-validated R2 of 0.803) from unimodal accuracies and two system-level factors. Theoretical and applied implications and recommendations are discussed.
Conference Paper
Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. Promising approaches have been reported, including automatic methods for facial and vocal affect recognition. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions-despite the fact that deliberate behavior differs in visual and audio expressions from spontaneously occurring behavior. Recently efforts to develop algorithms that can process naturally occurring human affective behavior have emerged. This paper surveys these efforts. We first discuss human emotion perception from a psychological perspective. Next, we examine the available approaches to solving the problem of machine understanding of human affective behavior occurring in real-world settings. We finally outline some scientific and engineering challenges for advancing human affect sensing technology.
A thorough review on the current advance of neural network structures
  • S Dupond
Computational communication science| automated coding of televised leader displays: detecting nonverbal political behavior with computer vision and deep learning
  • J Joo
  • E Bucy
  • C Seidel