Conference Paper

Thin slicing to predict viewer impressions of TED Talks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It may be distributed unchanged freely in print or electronic forms. [6] calculated the speed and acceleration of hand movement of speakers, and used their mean and peak values as features to predict user ratings for thin slices of TED videos. However, these features are difficult to interpret and correlate to human interpretable ideas of walking, arm movement, and head movement. ...
... We first preprocess the data by clipping the videos to the last one minute. Previous studies [6] have shown that thin slices of video of upto one minute have high correlation with audience ratings. For the TED dataset we remove videos with the tags 'Performance' and 'Live Music', and those for which a person is detected for less than 10 seconds in the slice. ...
... Approach Pipeline of hip and hand movement speeds, and the mean and standard deviation of head movement.Cullen and Harte (2017) ...
Conference Paper
Full-text available
A speaker's bodily cues such as walking, arm movement and head movement play a big role in establishing engagement with the audience. Low level, pose keypoint features to capture these cues have been used in prior studies to characterise engagement and ratings, but are typically not interpretable, and have not been subjected to analysis to understand their meaning. We thus apply a completely unsupervised approach on these low level features to obtain easily computable higher level features that represent low, medium, and high cue usage by the speaker. We apply our approach to classroom recorded lectures and the significantly more difficult dataset of TED videos, and are able to positively correlate our features to human interpretable ideas of a speaker's lateral head motion, and movement. We hope that the interpretable nature of these features can be used in future work to serve as a means of feedback to speakers, and to better understand the underlying structure behind the results.
... They show that talks with different ratings have different trajectories to some extent and suggest that varying the emotions during the talk, building a great ending, and initiating a snowball effect are key elements for a successful talk. In a similar study, Cullen and Harte (2017) investigate pieces of the talk to predict inspiring, funny and persuasive user ratings. They find that longer slices and slices towards the end of the talk have greater power to predict the rating of the talk. ...
Article
This paper aims to study how people utilize (search for, choose, process, and evaluate) information provided on online domains, emphasizing the balance between context identifiers and the actual content of information and the psychological processes. The study assesses the popularity of online provided materials, TED Talks, in relation to the length of information, user ratings, and several content-related features. The paper employs a comprehensive naturalistic data set that covers the titles, duration, viewer-assigned ratings/tags, transcripts, various content identifiers, and popularity (number of views) of 2685 TED Talks. The results reveal the relevance of both content and context-related factors, as well as psychological processes, on the popularity of the talks. On the context side, using certain words in the title and the text, optimizing the talk pace and the length of the talk; on the content side, carefully incorporating rhetorical features are major factors that influence the popularity of the talks. On the psychological processes front, the popularity of talks is associated with positive emotions and anxiety among affective processes, and insight and tentativeness among cognitive processes.
... The effect of the location of these slices has been investigated as well; the conventional approach is to situate the slices near the start of the interaction. The effectiveness of thin slices has been observed in many applications, ranging from judging personality traits [14,25] such as the "big five" [26] to viewer impressions of TED talks [27] such as "funny" and "inspiring". ...
Preprint
Full-text available
Automatic quantification of human interaction behaviors based on language information has been shown to be effective in psychotherapy research domains such as marital therapy and cancer care. Existing systems typically use a moving-window approach where the target behavior construct is first quantified based on observations inside a window, such as a fixed number of words or turns, and then integrated over all the windows in that interaction. Given a behavior of interest, it is important to employ the appropriate length of observation, since too short a window might not contain sufficient information. Unfortunately, the link between behavior and observation length for lexical cues has not been well studied and it is not clear how these requirements relate to the characteristics of the target behavior construct. Therefore, in this paper, we investigate how the choice of window length affects the efficacy of language-based behavior quantification, by analyzing (a) the similarity between system predictions and human expert assessments for the same behavior construct and (b) the consistency in relations between predictions of related behavior constructs. We apply our analysis to a large and diverse set of behavior codes that are used to annotate real-life interactions and find that behaviors related to negative affect can be quantified from just a few words whereas those related to positive traits and problem solving require much longer observation windows. On the other hand, constructs that describe dysphoric affect do not appear to be quantifiable from language information alone, regardless of how long they are observed. We compare our findings with related work on behavior quantification based on acoustic vocal cues as well as with prior work on thin slices and human personality predictions and find that, in general, they are in agreement.
Article
The task of quantifying human behavior by observing interaction cues is an important and useful one across a range of domains in psychological research and practice. Machine learning-based approaches typically perform this task by first estimating behavior based on cues within an observation window, such as a fixed number of words, and then aggregating the behavior over all the windows in that interaction. The length of this window directly impacts the accuracy of estimation by controlling the amount of information being used. The exact link between window length and accuracy, however, has not been well studied, especially in spoken language. In this paper, we investigate this link and present an analysis framework that determines appropriate window lengths for the task of behavior estimation. Our proposed framework utilizes a two-pronged evaluation approach: (a) extrinsic similarity between machine predictions and human expert annotations, and (b) intrinsic consistency between intra-machine and intra-human behavior relations. We apply our analysis to real-life conversations that are annotated for a large and diverse set of behavior codes and examine the relation between the nature of a behavior and how long it should be observed. We find that behaviors describing negative and positive affect can be accurately estimated from short to medium-length expressions whereas behaviors related to problem-solving and dysphoria require much longer observations and are difficult to quantify from language alone. These findings are found to be generally consistent across different behavior modeling approaches.
Conference Paper
Full-text available
The ability to speak proficiently in public is essential for many professions and in everyday life. Public speaking skills are difficult to master and require extensive training. Recent developments in technology enable new approaches for public speaking training that allow users to practice in engaging and interactive environments. Here, we focus on the automatic assessment of nonverbal behavior and multimodal modeling of public speaking behavior. We automatically identify audiovisual nonverbal behaviors that are correlated to expert judges' opinions of key performance aspects. These automatic assessments enable a virtual audience to provide feedback that is essential for training during a public speaking performance. We utilize multimodal ensemble tree learners to automatically approximate expert judges' evaluations to provide post-hoc performance assessments to the speakers. Our automatic performance evaluation is highly correlated with the experts' opinions with r = 0.745 for the overall performance assessments. We compare multimodal approaches with single modalities and find that the multimodal ensembles consistently outperform single modalities.
Conference Paper
Full-text available
Video content is being produced in ever increasing quantities and offers a potentially highly diverse source for personalizable content. A key characteristic of quality video content is the engaging experience it offers for end users. This paper explores how different characteristics of a video, e.g. face detection, paralinguistic features in the audio track, extracted from different modalities in the video can impact how users rate and thereby engage with the video. These characteristics can further be used to help segment videos in a personalized and contextually aware manner. Initial experimental results from the study presented in this paper provide encouraging results.
Conference Paper
Full-text available
This paper provides an overview of previous works on the acoustic parameters of charismatic voice and illustrates MASCharP, a scale for measuring the perception of charisma in voice. A study is then presented on the perception of charisma through the temporal and pitch structure of the voices of an Italian and a French politician. Results show some cultural differences in charisma perception and how acoustic features such as pitch (normal, higher, or lower) and types of pauses (short or long) can affect the Proactive-Attractive and Calm-Benevolent dimensions of charisma. The same dimension of charisma can be conveyed by different acoustic correlates of voice by connecting them to the dimension of leader extraversion-introversion.
Article
Full-text available
Body movements communicate affective expressions and, in recent years, computational models have been developed to recognize affective expressions from body movements or to generate movements for virtual agents or robots which convey affective expressions. This survey summarizes the state of the art on automatic recognition and generation of such movements. For both automatic recognition and generation, important aspects such as the movements analyzed, the affective state representation used, and the use of notation systems is discussed. The survey concludes with an outline of open problems and directions for future work.
Conference Paper
Full-text available
This paper introduces a new dataset and compares several methods for the recommendation of non-fiction audiovisual material, namely lectures from the TED website. The TED dataset contains 1,149 talks and 69,023 profiles of users, who have made more than 100,000 ratings and 200,000 comments. This data set, which we make public, can be used for training and testing of generic and personalized recommendation tasks. We define content-based, collaborative, and combined recommendation methods for TED lectures and use cross-validation to select the best parameters of keyword-based (TFIDF) and semantic vector space-based methods (LSI, LDA, RP, and ESA). We compare these methods on a personalized recommendation task in two settings, a cold-start and a non-cold-start one. In the former, semantic-based vector spaces perform better than keyword-based ones. In the latter, where collaborative information can be exploited, content-based methods are outperformed by collaborative filtering ones, but the proposed combined method shows acceptable performances, and can be used in both settings.
Article
Full-text available
We introduce the automatic determination of leadership emergence by acoustic and linguistic features in on-line speeches. Full realism is provided by the varying and challenging acoustic conditions of the presented YouTube corpus of on-line available speeches labeled by ten raters and by processing that includes Long Short-Term Memory based robust voice activity detection and automatic speech recognition prior to feature extraction. We discuss cluster-preserving scaling of ten original dimensions for discrete and continuous task modeling, ground truth establishment and appropriate feature extraction for this novel speaker trait analysis paradigm. In extensive classification and regression runs different temporal chunkings %, feature relevance and optimal late fusion strategies of feature streams are presented. In the result, achievers, charismatic speakers and teamplayers can be recognized significantly above chance level reaching up to 72.5% accuracy on unseen test data.
Conference Paper
Full-text available
One of the main challenges in emotion recognition from speech is to discriminate emotions in the valence domain (positive ver-sus negative). While acoustic features provide good characteri-zation in the activation/arousal dimension (excited versus calm), they usually fail to discriminate between sentences with differ-ent valence attributes (e.g., happy versus anger). This paper fo-cuses on this dimension, which is key in many behavioral prob-lems (e.g., depression). First, a regression analysis is conducted to identify the most informative features. Separate support vec-tor regression (SVR) models are trained with various feature groups. The results reveal that spectral and F0 features pro-duce the most accurate predictions of valence. Then, sentences with similar activation, but with different valence are carefully studied. The discriminative power in valence domain of indi-vidual features is studied with logistic regression analysis. This controlled experiment reveals differences between positive and negative emotions in the F0 distribution (e.g., positive skew-ness). The study also uncovers characteristic trends in the spec-tral domain.
Article
Full-text available
In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a command while making 8 different emotional expressions. Gender was equally represented, with speakers of several different native languages including French, German, Greek and Italian. Facial expression, gesture and acoustic analysis of speech were used to extract features relevant to emotion. For the automatic classification of unimodal data, bimodal data and multimodal data, a system based on a Bayesian classifier was used. After performing an automatic classification of each modality, the different modalities were combined using a multimodal approach. Fusion of the modalities at the feature level (before running the classifier) and at the results level (combining results from classifier from each modality) were compared. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison to the unimodal systems: the multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system. Bimodal emotion recognition based on all combinations of the modalities (i.e., ‘face-gesture’, ‘face-speech’ and ‘gesture-speech’) was also investigated. The results show that the best pairing is ‘gesture-speech’. Using all three modalities resulted in a 3.3% classification improvement over the best bimodal results. Affective body language-Affective speech-Facial expression-Emotion recognition-Multimodal fusion
Article
Full-text available
Recent research has shown that rapid judgments about the personality traits of political candidates, based solely on their appearance, can predict their electoral success. This suggests that voters rely heavily on appearances when choosing which candidate to elect. Here we review this literature and examine the determinants of the relationship between appearance-based trait inferences and voting. We also reanalyze previous data to show that facial competence is a highly robust and specific predictor of political preferences. Finally, we introduce a computer model of face-based competence judgments, which we use to derive some of the facial features associated with these judgments. KeywordsFirst impressions-Voting-Political decision making-Face perception-Social cognition
Article
Full-text available
The ability to recognize humans and their activities by vision is key for a machine to interact intelligently and effortlessly with a human-inhabited environment. Because of many potentially important applications, “looking at people” is currently one of the most active application domains in computer vision. This survey identifies a number of promising applications and provides an overview of recent developments in this domain. The scope of this survey is limited to work on whole-body or hand motion; it does not include work on human faces. The emphasis is on discussing the various methodologies; they are grouped in 2-D approaches with or without explicit shape models and 3-D approaches. Where appropriate, systems are reviewed. We conclude with some thoughts about future directions.
Conference Paper
Full-text available
This paper deals with subjective qualities and acoustic- prosodic features contributing to the impression of a good speaker. Subjects rated a variety of samples of political speech on a number of subjective qualities and acoustic fea- tures were extracted from the speech samples. A perceptual evaluation was also conducted with manipulations of F0 dynamics, fluency and speech rate with the sample of the lowest rated speaker as a basis. Subjects' ranking revealed a clear preference for modified versions over the original with F0 dynamics - a wider range - being the most powerful cue. Index Terms: prosody, speaker skill, subject ratings, acoustic measurements, synthesis evaluation
Conference Paper
Full-text available
Detecting levels of interest from speakers is a new problem in Spoken Dialog Under- standing with significant impact on real world business applications. Previous work has fo- cused on the analysis of traditional acous- tic signals and shallow lexical features. In this paper, we present a novel hierarchical fu- sion learning model that takes feedback from previous multistream predictions of promi- nent seed samples into account and uses a mean cosine similarity measure to learn rules that improve reclassification. Our method is domain-independent and can be adapted to other speech and language processing areas where domain adaptation is expensive to per- form. Incorporating Discriminative Term Fre- quency and Inverse Document Frequency (D- TFIDF), lexical affect scoring, and low and high level prosodic and acoustic features, our experiments outperform the published results of all systems participating in the 2010 Inter- speech Paralinguistic Affect Subchallenge.
Article
Full-text available
This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the Integral Image which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a cascade which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
Article
Full-text available
This paper presents a new classification algorithm for real-time inference of affect from nonverbal features of speech and applies it to assessing public speaking skills. The classifier identifies simultaneously occurring affective states by recognizing correlations between emotions and over 6,000 functional-feature combinations. Pairwise classifiers are constructed for nine classes from the Mind Reading emotion corpus, yielding an average cross-validation accuracy of 89 percent for the pairwise machines and 86 percent for the fused machine. The paper also shows a novel application of the classifier for assessing public speaking skills, achieving an average cross-validation accuracy of 81 percent and a leave-one-speaker-out classification accuracy of 61 percent. Optimizing support vector machine coefficients using grid parameter search is shown to improve the accuracy by up to 25 percent. The emotion classifier outperforms previous research on the same emotion corpus and is successfully applied to analyze public speaking skills.
Article
Full-text available
Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions despite the fact that deliberate behaviour differs in visual appearance, audio profile, and timing from spontaneously occurring behaviour. To address this problem, efforts to develop algorithms that can process naturally occurring human affective behaviour have recently emerged. Moreover, an increasing number of efforts are reported toward multimodal fusion for human affect analysis including audiovisual fusion, linguistic and paralinguistic fusion, and multi-cue visual fusion based on facial expressions, head movements, and body gestures. This paper introduces and surveys these recent advances. We first discuss human emotion perception from a psychological perspective. Next we examine available approaches to solving the problem of machine understanding of human affective behavior, and discuss important issues like the collection and availability of training and test data. We finally outline some of the scientific and engineering challenges to advancing human affect sensing technology.
Conference Paper
Full-text available
This paper describes the use of statistical techniques and hidden Markov models (HMM) in the recognition of emotions. The method aims to classify 6 basic emotions (anger, dislike, fear, happiness, sadness and surprise) from both facial expressions (video) and emotional speech (audio). The emotions of 2 human subjects were recorded and analyzed. The findings show that the audio and video information can be combined using a rule-based system to improve the recognition rate
Article
We hypothesize that certain speaker gestures can convey significant information that are correlated to audience engagement. We propose gesture attributes, derived from speakers' tracked hand motions to automatically quantify these gestures from video. Then, we demonstrate a correlation between gesture attributes and an objective method of measuring audience engagement: electroencephalography (EEG) in the domain of political debates. We collect 47 minutes of EEG recordings from each of 20 subjects watching clips of the 2012 U.S. Presidential debates. The subjects are examined in aggregate and in subgroups according to gender and political affiliation. We find statistically significant correlations between gesture attributes (particularly extremal pose) and our feature of engagement derived from EEG both with and without audio. For some stratifications, the Spearman rank correlation reaches as high as p = 0.283 with p < 0.05, Bonferroni corrected. From these results, we identify those gestures that can be used to measure engagement, principally those that break habitual gestural patterns.
Conference Paper
In this paper, we present improvements made to the TED-LIUM corpus we released in 2012. These enhancements fall into two categories. First, we describe how we filtered publicly available monolingual data and used it to estimate well-suited language models (LMs), using open-source tools. Then, we describe the process of selection we applied to new acoustic data from TED talks, providing additions to our previously released corpus. Finally, we report some experiments we made around these improvements.
Conference Paper
Recent studies have shown the importance of using online videos along with textual material in educational instruction, especially for better content retention and improved concept understanding. A key question is how to select videos to maximize student engagement, particularly when there are multiple possible videos on the same topic. While there are many aspects that drive student engagement, in this paper we focus on presenter speaking styles in the video. We use crowd-sourcing to explore speaking style dimensions in online educational videos, and identify six broad dimensions: liveliness, speaking rate, pleasantness, clarity, formality and confidence. We then propose techniques based solely on acoustic features for automatically identifying a subset of the dimensions. Finally, we perform video re-ranking experiments to learn how users apply their speaking style preferences to augment textbook material. Our findings also indicate how certain dimensions are correlated with perceptions of general pleasantness of the voice.
Conference Paper
The problem of automatically estimating the interest level of a subject has been gaining attention by researchers, mostly due to the vast applicability of interest detection. In this work, we obtain a set of continuous interest annotations for the SE-MAINE database, which we analyse also in terms of emotion dimensions such as valence and arousal. Most importantly, we propose a robust variant of Canonical Correlation Analysis (RCCA) for performing audio-visual fusion, which we apply to the prediction of interest. RCCA recovers a low-rank subspace which captures the correlations of fused modalities, while isolating gross errors in the data without making any assumptions regarding Gaussianity. We experimentally show that RCCA is more appropriate than other standard fusion techniques (such as l2-CCA and feature-level fusion), since it both captures interactions between modalities while also decontaminating the obtained subspace from errors which are dominant in real-world problems.
Article
Despite an increasing interest in understanding human perception in social media through the automatic analysis of users' personality, existing attempts have explored user profiles and text blog data only. We approach the study of personality impressions in social media from the novel perspective of crowdsourced impressions, social attention, and audiovisual behavioral analysis on slices of conversational vlogs extracted from YouTube. Conversational vlogs are a unique case study to understand users in social media, as vloggers implicitly or explicitly share information about themselves that words, either written or spoken cannot convey. In addition, research in vlogs may become a fertile ground for the study of video interactions, as conversational video expands to innovative applications. In this work, we first investigate the feasibility of crowdsourcing personality impressions from vlogging as a way to obtain judgements from a variate audience that consumes social media video. Then, we explore how these personality impressions mediate the online video watching experience and relate to measures of attention in YouTube. Finally, we investigate on the use of automatic nonverbal cues as a suitable lens through which impressions are made, and we address the task of automatic prediction of vloggers' personality impressions using nonverbal cues and machine learning techniques. Our study, conducted on a dataset of 442 YouTube vlogs and 2210 annotations collected in Amazon's Mechanical Turk, provides new findings regarding the suitability of collecting personality impressions from crowdsourcing, the types of personality impressions that emerge through vlogging, their association with social attention, and the level of utilization of nonverbal cues in this particular setting. In addition, it constitutes a first attempt to address the task of automatic vlogger personality impression prediction using nonverbal cues, with promising results.
Article
The accuracy of first impressions was examined by investigating judged construct (negative affect, positive affect, the Big five personality variables, intelligence), exposure time (5, 20, 45, 60, and 300 s), and slice location (beginning, middle, end). Three hundred and thirty four judges rated 30 targets. Accuracy was defined as the correlation between a judge’s ratings and the target’s criterion scores on the same construct. Negative affect, extraversion, conscientiousness, and intelligence were judged moderately well after 5-s exposures; however, positive affect, neuroticism, openness, and agreeableness required more exposure time to achieve similar levels of accuracy. Overall, accuracy increased with exposure time, judgments based on later segments of the 5-min interactions were more accurate, and 60 s yielded the optimal ratio between accuracy and slice length. Results suggest that accuracy of first impressions depends on the type of judgment made, amount of exposure, and temporal location of the slice of judged social behavior.
Article
Automatic detection of the level of human interest is of high relevance for many technical applications, such as automatic customer care or tutoring systems. However, the recognition of spontaneous interest in natural conversations independently of the subject remains a challenge. Identification of human affective states relying on single modalities only is often impossible, even for humans, since different modalities contain partially disjunctive cues. Multimodal approaches to human affect recognition generally are shown to boost recognition performance, yet are evaluated in restrictive laboratory settings only. Herein we introduce a fully automatic processing combination of Active–Appearance–Model-based facial expression, vision-based eye-activity estimation, acoustic features, linguistic analysis, non-linguistic vocalisations, and temporal context information in an early feature fusion process. We provide detailed subject-independent results for classification and regression of the Level of Interest using Support-Vector Machines on an audiovisual interest corpus (AVIC) consisting of spontaneous, conversational speech demonstrating “theoretical” effectiveness of the approach. Further, to evaluate the approach with regards to real-life usability a user-study is conducted for proof of “practical” effectiveness.
Conference Paper
In political speeches, the audience tends to react or resonate to signals of persuasive communication, including an expected theme, a name or an expression. Automatically predicting the impact of such discourses is a challenging task. In fact nowadays, with the huge amount of textual material that flows on the Web (news, discourses, blogs, etc.), it can be useful to have a measure for testing the persuasiveness of what we retrieve or possibly of what we want to publish on Web. In this paper we exploit a corpus of political discourses collected from various Web sources, tagged with audience reactions, such as applause, as indicators of persuasive expressions. In particular, we use this data set in a machine learning framework to explore the possibility of classifying the transcript of political discourses, according to their persuasive power, predicting the sentences that possibly trigger applause. We also explore differences between Democratic and Republican speeches, experiment the resulting classifiers in grading some of the discourses in the Obama-McCain presidential campaign available on the Web. 1.
Article
Charisma, the ability to attract and retain followers without benefit of formal authority, is more difficult to define than to identify. While we each seem able to identify charismatic individuals – and non-charismatic individuals – it is not clear what it is about an individual that influences our judgment. This paper describes the results of experiments designed to discover potential correlates of such judgments, in what speakers say and the way that they say it. We present results of two parallel experiments in which subjective judgments of charisma in spoken and in transcribed American political speech were analyzed with respect to the acoustic and prosodic (where applicable) and lexico-syntactic characteristics of the speech being assessed. While we find that there is considerable disagreement among subjects on how the speakers of each token are ranked, we also find that subjects appear to share a functional definition of charisma, in terms of other personal characteristics we asked them to rank speakers by. We also find certain acoustic, prosodic, and lexico-syntactic characteristics that correlate significantly with perceptions of charisma. Finally, by comparing the responses to spoken vs. transcribed stimuli, we attempt to distinguish between the contributions of “what is said” and “how it is said” with respect to charisma judgments.
Prosodic aspects of political rhetoric
  • P Touati
P. Touati, "Prosodic aspects of political rhetoric," in ESCA Workshop on Prosody, 1993, pp. 168 -171.