Conference Paper

Based on Isolated Saliency or Causal Integration? Toward a Better Understanding of Human Annotation Process using Multiple Instance Learning and Sequential Probability Ratio Test

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Human perception is capable of integrating local events to generate an overall impression at the global level; this is evident in daily life and is utilized repeatedly in behavioral science studies to bring objective measures into studies of human behavior. In this work, we explore two hypotheses considering whether it is the isolated-saliency or the causal-integration of information that can trigger the global perceptual behavioral ratings as trained annotators engage in tasks of observational coding. We carry out analyses using Multiple Instance Learning and Sequential Probability Ratio Test in a corpus of real and spontaneous distressed couples' interaction with global sessionlevel abstract behavioral coding done by trained human annotators. We present various analyses based on different behavioral detection schemes demonstrating the potential of utilizing these algorithms in bringing insights into the human annotation process. We further show that while annotating behaviors with more positive impression, annotators gather information throughout the session compared to behaviors with more negative impression, where a single salient instance is enough to trigger the final global decision.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Here are examples of works that have used those lexical features: bag-of-words (unigram [21,37], ngram [5,20,70]), TF-IDF [39,52,60], LIWC [5], word embeddings (word2vec [94,95], ELMo [20]), deep sentence embeddings (seq-to-seq Emotion Recognition among Couples: A Survey 13 models [22,93,94,96], BERT and Sentence-BERT [5,11]). Transfer learning has also been used for the lexical data. ...
... The surveyed works used mostly supervised learning approaches with a few using semi-supervised [21,38,39,52,59,60,101] and unsupervised learning [62,93]. The algorithms used range from simple statistical algorithms and traditional machine learning to deep learning methods. ...
... Here are the algorithms used by various works: SVM [5,6,8,11,13,38,39,52,58,61,94,95,101,102], linear discriminant analysis (LDA) [6,101], markov models [21,56,57,70,101], multiple instance learning (diversity density [39,59,60], diversity density SVM [38,52]), maximum likelihood [20,21,37,70], sequential probability ratio test [60], logistic regression [8], perceptron [101], gaussian mixture model (GMM) [102], deep neural networks [22,61,62,96], LSTM [93][94][95][96], GRU [20,63], random forest [11,26], CNN [63]. ...
Preprint
Full-text available
Couples' relationships affect the physical health and emotional well-being of partners. Automatically recognizing each partner's emotions could give a better understanding of their individual emotional well-being, enable interventions and provide clinical benefits. In the paper, we summarize and synthesize works that have focused on developing and evaluating systems to automatically recognize the emotions of each partner based on couples' interaction or conversation contexts. We identified 28 articles from IEEE, ACM, Web of Science, and Google Scholar that were published between 2010 and 2021. We detail the datasets, features, algorithms, evaluation, and results of each work as well as present main themes. We also discuss current challenges, research gaps and propose future research directions. In summary, most works have used audio data collected from the lab with annotations done by external experts and used supervised machine learning approaches for binary classification of positive and negative affect. Performance results leave room for improvement with significant research gaps such as no recognition using data from daily life. This survey will enable new researchers to get an overview of this field and eventually enable the development of emotion recognition systems to inform interventions to improve the emotional well-being of couples.
... A major challenge facing this research area is that there exists a heterogeneous set of manually collected and labeled datasets. Two important problems need to be addressed to widen the horizon of current research: (1) to develop training methods that can learn effectively from heterogeneous data sources and generalize well across different emotion corpora, and (2) to develop techniques that can leverage the large amount of unlabeled speech data in an active learning or semi-supervised framework. ...
... Sub-utterance data selection is motivated by the hypothesis that not all parts of an utterance are equally indicative of the target emotion, thus identifying and training only on salient local regions will facilitate generalization. Previous works have demonstrated the importance of local emotion dynamics in both human perception [2] and automatic classification [3]- [6]. ...
... As discussed in [2], it may be the case that only a short segment within an utterance actually influences human perception of certain emotion. In such cases, considering all context windows in an utterance might introduce too much noise and mask the salient local pattern. ...
Conference Paper
Full-text available
Data selection is an important component of cross-corpus training and semi-supervised/active learning. However, its effect on acoustic emotion recognition is still not well understood. In this work, we perform an in-depth exploration of various data selection strategies for emotion classification from speech using classifier agreement as the selection metric. Our methods span both the traditional utterance as well as the less explored subutterance level. A median unweighted average recall of 70.68%, comparable to the winner of the 2009 INTERSPEECH Emotion Challenge, was achieved on the FAU Aibo 2-class problem using less than 50% of the training data. Our results indicate that sub-utterance selection leads to slightly faster convergence and significantly more stable learning. In addition, diversifying instances in terms of classifier agreement produces a faster learning rate, whereas selecting those near the median results in higher stability. We show that the selected data instances can be explained intuitively based on their acoustic properties and position within an utterance. Our work helps provide a deeper understanding of the strengths, weaknesses, and trade-offs of different data selection strategies for speech emotion recognition.
... Integrating information requires an in-depth investigation of the algorithmic metrics of behavior (say log likelihood of an empathetic statement) versus the impact this has on the coding process. In our recent work we briefly touched on this [15] in investigating how human annotators employ isolated saliency or causal integration to make their decisions. In this work we will employ our DBM to investigate whether equal weight should be given to every turn (Sec. ...
... In our work, we have used perceptual methods that use local information to derive the same global decisions. For example, in [15] we evaluated whether global behavior could be effectively judged based on a locally isolated, yet highly informative event or by integrating information over time. Similarly, in this work, the DBM employs two information-integration techniques, one that assumes that each talk-turn conveys exactly the same amount of information and should be counted as such, and one that employs probabilistic measures for accumulating behavioral beliefs to reach global decisions. ...
... Similar effects were reported by Li et al. [30] where they varied the receptive field of a 1-dimensional Convolutional Neural Network-based behavior classifier from 3.75 seconds to 60 seconds and found that classification improved for many behaviors, but mainly for Negative and Blame. Other efforts have addressed related aspects; for instance, Lee et al. [31] examined whether the behavior annotation process is driven more by a gradual, causal mechanism or by isolated salient events, which imply the use of long and short observation windows respectively. ...
Preprint
Full-text available
Automatic quantification of human interaction behaviors based on language information has been shown to be effective in psychotherapy research domains such as marital therapy and cancer care. Existing systems typically use a moving-window approach where the target behavior construct is first quantified based on observations inside a window, such as a fixed number of words or turns, and then integrated over all the windows in that interaction. Given a behavior of interest, it is important to employ the appropriate length of observation, since too short a window might not contain sufficient information. Unfortunately, the link between behavior and observation length for lexical cues has not been well studied and it is not clear how these requirements relate to the characteristics of the target behavior construct. Therefore, in this paper, we investigate how the choice of window length affects the efficacy of language-based behavior quantification, by analyzing (a) the similarity between system predictions and human expert assessments for the same behavior construct and (b) the consistency in relations between predictions of related behavior constructs. We apply our analysis to a large and diverse set of behavior codes that are used to annotate real-life interactions and find that behaviors related to negative affect can be quantified from just a few words whereas those related to positive traits and problem solving require much longer observation windows. On the other hand, constructs that describe dysphoric affect do not appear to be quantifiable from language information alone, regardless of how long they are observed. We compare our findings with related work on behavior quantification based on acoustic vocal cues as well as with prior work on thin slices and human personality predictions and find that, in general, they are in agreement.
... At test time, the output of the DNN system provides a score of the presence of behavior (as in Fig. 4), but doesn't provide a global rating. While a range of methods exist for fusing decisions (e.g., [16,17,24]), in this work we will use the simplest one: Average posteriors. We can treat the output of DNN, q k i as a proxy to the posterior probability of the behavior given the frame i for session k, and L k is the number of frames in session k. ...
... At test time, the output of the DNN system provides a score of the presence of behavior (as in Fig. 4), but doesn't provide a global rating. While a range of methods exist for fusing decisions (e.g., [15,16,23]), in this work we will use the simplest one: Average posteriors. We can treat the output of DNN, q k i as a proxy to the posterior probability of the behavior given the frame i for session k, and L k is the number of frames in session k. ...
Article
Full-text available
Observational studies are based on accurate assessment of human state. A behavior recognition system that models interlocutors' state in real-time can significantly aid the mental health domain. However, behavior recognition from speech remains a challenging task since it is difficult to find generalizable and representative features because of noisy and high-dimensional data, especially when data is limited and annotated coarsely and subjectively. Deep Neural Networks (DNN) have shown promise in a wide range of machine learning tasks, but for Behavioral Signal Processing (BSP) tasks their application has been constrained due to limited quantity of data. We propose a Sparsely-Connected and Disjointly-Trained DNN (SD-DNN) framework to deal with limited data. First, we break the acoustic feature set into subsets and train multiple distinct classifiers. Then, the hidden layers of these classifiers become parts of a deeper network that integrates all feature streams. The overall system allows for full connectivity while limiting the number of parameters trained at any time and allows convergence possible with even limited data. We present results on multiple behavior codes in the couples' therapy domain and demonstrate the benefits in behavior classification accuracy. We also show the viability of this system towards live behavior annotations.
... Future research could focus on sequential analysis of the psychologist's speech in order to gain more insights into the interaction dynamics between the child and the psychologist. For instance, it is of some interest to understand the point at which the psychologist makes a decision; this computation has been attempted in the couples therapy setting (Lee, Katsamanis, Georgiou, & Narayanan, 2012). Further, interaction processes such as prosodic entrainment can be computationally investigated in relation to expert-coded behaviors to lend deeper insights into underlying mechanisms (Lee et al., 2014). ...
Article
Purpose: The purpose of this study was to examine relationships between prosodic speech cues and autism spectrum disorder (ASD) severity, hypothesizing a mutually interactive relationship between the speech characteristics of the psychologist and the child. The authors objectively quantified acoustic-prosodic cues of the psychologist and of the child with ASD during spontaneous interaction, establishing a methodology for future large-sample analysis. Method: Speech acoustic-prosodic features were semiautomatically derived from segments of semistructured interviews (Autism Diagnostic Observation Schedule, ADOS; Lord, Rutter, DiLavore, & Risi, 1999; Lord et al., 2012) with 28 children who had previously been diagnosed with ASD. Prosody was quantified in terms of intonation, volume, rate, and voice quality. Research hypotheses were tested via correlation as well as hierarchical and predictive regression between ADOS severity and prosodic cues. Results: Automatically extracted speech features demonstrated prosodic characteristics of dyadic interactions. As rated ASD severity increased, both the psychologist and the child demonstrated effects for turn-end pitch slope, and both spoke with atypical voice quality. The psychologist's acoustic cues predicted the child's symptom severity better than did the child's acoustic cues. Conclusion: The psychologist, acting as evaluator and interlocutor, was shown to adjust his or her behavior in predictable ways based on the child's social-communicative impairments. The results support future study of speech prosody of both interaction partners during spontaneous conversation, while using automatic computational methods that allow for scalable analysis on much larger corpora.
Preprint
Full-text available
Couples generally manage chronic diseases together and the management takes an emotional toll on both patients and their romantic partners. Consequently, recognizing the emotions of each partner in daily life could provide an insight into their emotional well-being in chronic disease management. The emotions of partners are currently inferred in the lab and daily life using self-reports which are not practical for continuous emotion assessment or observer reports which are manual, time-intensive, and costly. Currently, there exists no comprehensive overview of works on emotion recognition among couples. Furthermore, approaches for emotion recognition among couples have (1) focused on English-speaking couples in the U.S., (2) used data collected from the lab, and (3) performed recognition using observer ratings rather than partner's self-reported / subjective emotions. In this body of work contained in this thesis (8 papers - 5 published and 3 currently under review in various journals), we fill the current literature gap on couples' emotion recognition, develop emotion recognition systems using 161 hours of data from a total of 1,051 individuals, and make contributions towards taking couples' emotion recognition from the lab which is the status quo, to daily life. This thesis contributes toward building automated emotion recognition systems that would eventually enable partners to monitor their emotions in daily life and enable the delivery of interventions to improve their emotional well-being.
Article
The task of quantifying human behavior by observing interaction cues is an important and useful one across a range of domains in psychological research and practice. Machine learning-based approaches typically perform this task by first estimating behavior based on cues within an observation window, such as a fixed number of words, and then aggregating the behavior over all the windows in that interaction. The length of this window directly impacts the accuracy of estimation by controlling the amount of information being used. The exact link between window length and accuracy, however, has not been well studied, especially in spoken language. In this paper, we investigate this link and present an analysis framework that determines appropriate window lengths for the task of behavior estimation. Our proposed framework utilizes a two-pronged evaluation approach: (a) extrinsic similarity between machine predictions and human expert annotations, and (b) intrinsic consistency between intra-machine and intra-human behavior relations. We apply our analysis to real-life conversations that are annotated for a large and diverse set of behavior codes and examine the relation between the nature of a behavior and how long it should be observed. We find that behaviors describing negative and positive affect can be accurately estimated from short to medium-length expressions whereas behaviors related to problem-solving and dysphoria require much longer observations and are difficult to quantify from language alone. These findings are found to be generally consistent across different behavior modeling approaches.
Article
The ability to accurately judge another person's emotional states with a short duration of observations is a unique perceptual mechanism of humans, termed as the thin-sliced judgment. In this work, we propose a computational framework based on mutual information to identify the thin-sliced emotion-rich behavior segments within each session and further use these segments to train the session-level affect regressors. Our proposed thin-sliced framework obtains regression accuracies measured in Spearman correlations of 0.605, 0.633, and 0.672 on session-level attributes of activation, dominance, and valence respectively. It outperforms framework using data of the entire session as baseline. The significant improvement in the regression correlations reinforces the thin-sliced nature of human emotion perception. By properly extracting the these emotion-rich behavior segments, we obtain not only an improved overall accuracy but also bring additional insights. Specifically, our detailed analyses indicate that this thin-sliced nature in emotion perception is more evident for attributes of activation and valence, and the within-session time distribution of emotion-salient behavior is located more toward the ending portion. Lastly, we observe that there indeed exists a certain set of behavior types that carry high emotion-related content, and this is especially apparent in the extreme emotion levels.
Conference Paper
Emotion recognition is the process of identifying the affective characteristics of an utterance given either static or dynamic descriptions of its signal content. This requires the use of units, windows over which the emotion variation is quantified. However, the appropriate time scale for these units is still an open question. Traditionally, emotion recognition systems have relied upon units of fixed length, whose variation is then modeled over time. This paper takes the view that emotion is expressed over units of variable length. In this paper, variable-length units are introduced and used to capture the local dynamics of emotion at the sub-utterance scale. The results demonstrate that subsets of these local dynamics are salient with respect to emotion class. These salient units provide insight into the natural variation in emotional speech and can be used in a classification framework to achieve performance comparable to the state-of-the-art. This hints at the existence of building blocks that may underlie natural human emotional communication.
Conference Paper
Psychology is often grounded in observational studies of human interaction behavior, and hence on human perception and judgment. There are many practical and theoretical challenges in observational practice. Technology holds the promise of mitigating some of these difficulties by assisting in the evaluation of higher level human behavior. In this work we attempt to address two questions: (1) Does the lexical channel contain the necessary information towards such an evaluation; and if yes (2) Can such information be captured by a noisy automated transcription process. We utilize a large corpus of couple interaction data, collected in the context of a longitudinal study of couple therapy. In the original study, each spouse was manually evaluated with several sessionlevel behavioral codes (e.g., level of acceptance toward other spouse). Our results will show that both of our research questions can be answered positively and encourage future research into such assistive observational technologies.