Conference PaperPDF Available

Honest signals in video conferencing

Authors:

Abstract and Figures

We propose a novel system to analyze gestural and nonverbal cues of participants in video conferencing. These cues have previously been referred to as “honest signals” and are usually associated with the underlying cognitive state of the participants. The presented system analyzes a set of audio-visual, non-linguistic features in real time from the audio and video streams of two participants in a video conference. We show how these features can be used to compute indicators of the overall quality and type of conversation being held. The system also provides visual feedback to the participants, who then have the choice of modifying their conversational style in order to achieve the desired outcome of the video conference. Experiments on real-life data show that the system can predict the type of conversation with high accuracy using the non-linguistic signals only. Qualitative user studies highlight the positive effects of increased awareness amongst the participants about their own gestural and non-verbal cues.
Content may be subject to copyright.
HONEST SIGNALS IN VIDEO CONFERENCING
Byungki Byun1,2, Anurag Awasthi1,3, Philip A. Chou1, Ashish Kapoor1, Bongshin Lee1, Mary Czerwinski1
1Microsoft Research
Redmond, WA, 98052, USA
{pachou, akapoor, bongshin, marycz}
@microsoft.com
2School of ECE
Georgia Institute of Technology
Atlanta, GA, 30332, USA
yorke3@ece.gatech.edu
3Department of CSE
Indian Institute of Technology Kanpur
208016, India
anuraga@cse.iitk.ac.in
ABSTRACT
We propose a novel system to analyze gestural and non-
verbal cues of participants in video conferencing. These
cues have previously been referred to as honest signals
and are usually associated with the underlying cognitive
state of the participants. The presented system analyzes a set
of audio-visual, non-linguistic features in real time from the
audio and video streams of two participants in a video
conference. We show how these features can be used to
compute indicators of the overall quality and type of
conversation being held. The system also provides visual
feedback to the participants, who then have the choice of
modifying their conversational style in order to achieve the
desired outcome of the video conference. Experiments on
real-life data show that the system can predict the type of
conversation with high accuracy using the non-linguistic
signals only. Qualitative user studies highlight the positive
effects of increased awareness amongst the participants
about their own gestural and non-verbal cues.
Index Terms Non-verbal behavior, gesture analysis,
video conference, honest signals.
1. INTRODUCTION
All animals have evolved to communicate with each other
using non-linguistic signals. For example, they have been
observed to use physical appearance, movement, and/or
vocalization to convey dominance, cooperation, warning,
trust, and so forth.
In recent years, the rapid increase of the capacity of
digital communication has enabled humans to invent many
ways of communicating with each other. For humans,
linguistic means have always been dominant even in these
new communication methods. However, as animals, humans
have retained the ability to convey and to utilize non-
linguistic signals during communication, including during
video conferences.
Non-linguistic signals used for communication are
termed honest signals if, in addition to communicating
information intended by the sending animal, they also carry
cues about the sender‟s underlying state (e.g., emotion,
fatigue, or confusion) that are difficult for the sender to hide.
Since these cues are detectable by the receiving animal(s),
there is a good chance that they should be detectable by
sensing devices coupled to computers.
One approach to detecting and analyzing such non-
linguistic signals is to have the group of people, whose
honest signals are to be sensed, wear electronic badges
draped around their necks [1][2]. Data collected in such a
way has been the subject of numerous experiments and has
been shown to be predictive of the outcomes of a variety of
social interactions, including speed dating, elevator pitches,
sales, salary negotiation, and card games. However, wearing
a badge or any other type of device is intrusive and is
unlikely to be adopted by real users.
Other approaches have focused on the analysis of facial
expressions, utterances, postures and physiology for cues on
underlying internal state [3-8]. There is also a body of
related work on human interaction analysis. For example,
[9][10] show that interactions between people in a meeting
can be analyzed by tracking non-verbal cues, and [11][12]
demonstrate visualization of the amount of time attendees
are talking during a meeting influences peoples behavior.
In this work, we aim to sense and analyze honest
signals in the context of 1-1 video conferencing. From
audiovisual signals recorded by microphones and video
cameras, we non-intrusively extract low-level audio and
video features to characterize honest signals and social roles,
and to present diagnostics of the signals back to the users
through a graphical interface, all in real time. Our
hypothesis is that such visualization of users nonverbal
behavior can help the users modify their actions and make
conversations more successful.
There are three main contributions of this work:
We explore and analyze a spectrum of gestural
behaviors and non-verbal cues that are important in 1-1
video teleconferencing.
We show how to build modules that can predict key
aspects of the ongoing conference, such as behaviors
indicative of quality and type of conversations, from the
low-level features extracted from non-verbal behaviors
and gestures.
We present a detailed user study and experimentally
show that such a predictive framework works well and
has the capability to positively influence the
conversations in the video conference.
2. HONEST SIGNALS AND SOCIAL ROLES
For completeness, we briefly explain four honest signals
considered in this work, originally studied in [2]. They are
(a) activity, (b) consistency, (c) influence, and (d) mimicry.
Activity refers to the energetic state of a person, reflecting
his or her degree of involvement in a conversation.
Consistency refers to the degree of regularity or cadence of
behavior, primarily during speech, reflecting mental
certainty. Influence refers to degree of control of one person
over the conversation, reflecting interest or desire to
dominate. Mimicry refers to a behavioral pattern that mimics
the behavior of others, reflecting agreement or empathy.
These honest signals are shown to be predictive of the
outcomes of a variety of social interactions. Moreover, these
signals are also predictive of social roles. Four particular
social roles examined in [2] are exploring, active listening,
teaming, and leading. Relating honest signals with the social
roles, exploring (e.g., looking for points in common with
another person) is indicated by high activity and low
consistency. Active listening (e.g., listening and reflecting
information back to the talker) is indicated by low activity
and low consistency, and teaming (i.e., working together in
a team, showing cooperative behavior) is indicated by high
influence, high mimicry, and high consistency. Finally,
leading (e.g., dominance or control in a group) is indicated
by high activity, high influence, and high consistency.
3. PRELIMINARY STUDY
To test the hypothesis that real time visual presentation of
honest signals would influence participants‟ behaviors
during a video conference, we first ran a wizard-of-oz user
study in the laboratory bringing in pairs of users and giving
them two controversial conversational topics to discuss (e.g.,
do you prefer Apple or generic PC products?). Each user
was to argue for a particular side of an argument to try to
influence the other participant via a video conferencing tool.
For one conversational topic, no visual representation of
their behaviors was included, and for the other, we
presented to both users a visual user interface showing the
levels of their own honest signals as in Fig. 1. In this
preliminary study, we showed Excitement as a proxy for
activity, Openness as a proxy for the converse of
consistency and/or influence, and Agreement as a proxy
for mimicry. We also showed a speaking timeline and
overall speaking proportion.
In this study, the signal levels were actually controlled
by two researchers, who were adjusting the levels in real
time as the conversation unfolded. After each condition, we
asked the users to rate the system usefulness, features they
used, if any, and behavioral changes if any. Users were
satisfied overall with the visual feedback of their behavior.
They especially appreciated seeing the amount of time they
spent talking compared to the other user. If they saw that
they were dominating the discussion, they said they tried to
give the other user more opportunity to speak. While some
of the user interface elements we chose to display were
confusing to participants, positive comments indicated the
utility of the information. For example:
I looked at the time mostly though but it gave me an
idea of how our speaking time was shared. I am not
sure why but I felt more in control of the video
conversation.
Some of it [was useful]. To be honest, I'm not really
sure what openness is referring too. Seems vague ...
excitement and agreement are good. Using the timeline
more than percentage, would be nice to see them
together.
It certainly helped moderate the discussion. I also think
it might keep meetings from wandering-- given that the
time elapsed is shown.
Thus, we refined the user interface, moving forward to
develop a system that could give honest signal-style
feedback to users in real time.
4. SYSTEM OVERVIEW
Here, we provide an overview of the real-time system
developed. As illustrated in Fig. 2, a pair of clients
establishes an ordinary video conference between them, and
starts to stream audio and video data to each other. As soon
as the video conference is started, each client automatically
establishes a TCP connection to an honest signal server. The
server listens for these TCP connections arriving from
various clients, matches the connections coming from a
communicating pair, and spawns a process to handle each
pair of clients in conversation.
The clients intercept their own audio and video signals,
extract low-level features (e.g., pitch period, visual motion)
from each frame of audio and video, and transmit the low-
level features over their connections to the server. The TCP
connection carries low bandwidth, constant bit rate
information from client to server using about 5 kbps: about
12 bytes for every 30 ms for audio and about 60 bytes for
every 67 ms for video. The server process synchronizes the
Fig. 1.
The graphical user interface showing conversation
feedback used in a wizard-of-oz study. The right panel
shows the signal levels, speaking timeline and proportion.
streams of the low-level features from the two clients,
evaluates additional intermediate-level features from these
low-level features, estimates the level of the four honest
signals, and feeds honest signal estimates back to its
respective clients for presentation to the end-users in real
time. Transmitting the estimates of the honest signals
requires only about 160 bps from the server to each client.
The connections are maintained until the end of conference.
The key computation at the server is real-time
estimation of the users‟ honest signals from the low-level
features of each pair of communicating clients. In particular,
at regular intervals (once per second in this work), the low-
level features are first processed into a vector of
intermediate-level features. Next, these intermediate-level
features are fed into four logistic-regression models trained
a priori on data collected as described in Section 6. Each of
the logistic regression models estimates the probability that
the corresponding honest signal is in high state. These
probabilities are then sent to the respective client for
presentation.
Note that the system can be extended easily to multi-
party conversations as computation of the intermediate-level
features and estimation of the honest signals are all done at
the server.
5. NON-LINGUISTIC FEATURES
In this section, we describe the non-linguistic features used.
The low-level features extracted at each client primarily
capture individual behavioral patterns. In contrast, the
intermediate-level features computed at the server based on
the low-level features describe not only individual
behavioral patterns but also mutual interactions. We
handcrafted the features to be indicative of the four honest
signals, which are embedded in participants‟ behavioral
patterns and mutual interactions captured by the audio and
video. Later, the logistic regression models that estimate the
honest signals from these features are trained from data
collected in our user studies.
5.1. Low-level features
The low-level features are captured at 30 ms frame intervals
for audio and 67 ms frame intervals (15 fps) for video. For
audio, three features are computed per frame: pitch, voice
activity, and spectral distance. For pitch, a pitch tracker
performs linear predictive coding (LPC) [13] and uses
dynamic programming to estimate the pitch (or zero if the
frame is unvoiced or silent). For voice activity detection, we
use the power spectrum of a frame at time t after identifying
voiced frames with the pitch tracker. In particular, voice
activity is detected when the total power spectrum of the
frame at time t is larger than a threshold and its neighboring
frames are determined as voiced by the pitch tracker. Finally,
for spectral distance, we compute the Itakura distortion [14]
from the LPC coefficients between two consecutive frames.
For video, 15 features are computed per frame. A face
tracker [15] provides one rectangular region where a face is
detected. The magnitude of the motion of the center of the
rectangle between frames is one feature. Two other features
are the average magnitude of the motion vectors inside the
facial region and the average magnitude of the motion
vectors outside the facial region, where the motion vectors
are computed through an optical flow computation between
consecutive frames. Twelve outputs of a Gabor filter-bank
with two parameters, scale and orientation, to account for
abrupt changes of facial expressions [16] are also extracted
from the facial region for future work.
5.2. Intermediate-level features
The intermediate-level features are computed at regular
intervalsonce per second in our experimentsfor both audio
and video. For each participant, a vector of intermediate-
level features is computed based on the low-level features of
both participants since the intermediate-level features may
need to capture behavioral interactions between participants.
One of the distinct behavioral interactions in a
conversation is turn-taking. At one extreme, a single person
can dominate a conversation without giving a chance for
anyone else to talk, while at the other extreme, a person can
never talk but always listen. We define three unique turn-
taking patterns in this work: barge-in, grant-floor, and
suppression, as illustrated in Fig. 3.
In Fig. 3, a high value means a person is talking and a
low value means the person is silent. Let Person 2 be an
actor who wants to control turn-taking and let Person 1 be a
member of the audience who gets influenced by the actor.
Then, in Fig. 3, from left to right, Person 2 barges in on 1,
Fig. 2. A block diagram of the real time system.
Fig. 3.
Example timeline of turn taking between Persons 1
(audience) and 2 (actor). These turn taking patterns can be
used to analyze the properties of a conversation.
suppresses 1, and finally grants the floor to 1. Specifically,
by starting to talk, barge-in forces the other person to stop
talking and by continuing to talk, suppression prevents the
other person from taking the floor. Finally, converse to
barge-in, by stopping to talk, grant-floor allows the other
person to start talking.
To extract these turn-taking patterns, we use voice
activity detection to measure when someone starts and stops
talking. Given continuous segments of voice activity along
the time axis, let
and
be start times of the segments,
and let
and
be termination times of the segments for
Persons 1 and 2, respectively. Then, from the perspective of
Person 2, ,  , and  can be defined (where
)(I
is an indicator function) as

 
 
 
,
  
 
  
 
,

 
 
 
 
 
.
The number of barge-in, grant-floor, and suppression events
within the last To (set to 30) seconds, and the average of
their durations within the last To seconds are the first six
intermediate-level features for Person 2. Intermediate-level
features for Person 1 are similarly defined.
Four other intermediate-level features are extracted
from prosody: (a) average speaking rate, (b) pitch variation,
(c) syllabic rate variation, and (d) spectral distance variation.
As studied in [17], prosody is closely related to the
emotional state of a speaker. For example, if one is
overwhelmed by questions, the person will more likely
exhibit high variation in prosody than when the person is in
a normal emotional state. Among these features, average
speaking rate is the simplest to compute: the percentage of
active frames over the last To seconds.
To compute variations in pitch, syllabic rate and
spectral distance, we first compute their averages over the
previous To seconds, as well as over every to (set to 3)
second within the To seconds. Syllabic rate is determined by
counting the number of voiced/unvoiced transitions per
second. We then compute the squared variation of these to-
second averages around the To-second average. The
variation in syllabic rate indicates changes in the speed of
utterances while the variation in spectral distance should
capture informative events in utterances such as exclamation.
Additionally, to model variation of intonation in speech,
two autoregressive models with an order of Q and q are also
trained given continuous segments of the estimated pitch
frequencies over the last To and to seconds. If there are
multiple such segments, segments that are longer than a
certain threshold are selected and concatenated to train Rt
and rt, the models at time t for To and to, respectively. Given
Rt and rt, we then measure the normalized squared residual
errors of the pitch frequency for the interval of [t, t+1],
creating our next two features.
Furthermore, we compute the average magnitude and
variation in head motion rate from the low-level motion
vectors and location of the tracked facial regions as our next
two features. Like prosody, physical motion is closely
associated with a persons internal state, especially the level
of excitement.
Besides 14 intermediate-level features described above,
we also compute 12 additional intermediate-level features
involving means and variances of pitch, syllabic rate, and
spectral distance over both To and to seconds, which
complete 26 features used in our experiments.
For future exploration, we also implement head-nod
and shake detectors [18] to identify higher-level cues of the
emotional state of a speaker using Hidden Markov Models
(HMMs), where there are two hidden states and eight
Gaussian mixtures with the motion vectors extracted from
the facial region to be observables of the HMMs.
6. EXPERIMENTAL RESULTS
First, we conducted a user study to collect real-world data to
train and test our estimators of the honest signals. We then
tackled the problem of identifying conversation type to
explore predictive power of non-linguistic cues. A second
user study with different participants was later conducted
using the learned estimators to verify the hypothesis
regarding effects of real-time visualization of the honest
signals in video conferencing.
6.1. Data collection
In the first user study, we recruited 20 people to carry out 10
pairs of conversations. Each pair was asked to role play in
five types of conversations that included salary negotiation,
expounding upon or listening to a personal opinion on a
controversial topic, expounding upon or listening to a
personal dilemma, exploring what is in common, and
brainstorming. Two experimenters monitored the
conversations and annotated key events on the time line.
Such key events consisted of episodes of high/low activity,
high/low influence, high/low consistency, and high mimicry.
Low mimicry was not explicitly labeled because suitable
examples of low mimicry could be chosen arbitrarily. To
ensure quality in labeling, each experimenter monitored
only one participant.
The length of each conversation was seven to eight
minutes resulting in about 350 to 400 minutes of
conversation in total (5 conversations from each of 10 pairs
of participants). In the end, this exercise resulted in a data
set consisting of a collection of instances of varying lengths
labeled for high and low levels of each signal during each
conversation. The average length of such instances was 11
seconds across all conversations. The episodes for activity
and mimicry were the shortest, from three to ten seconds,
while episodes indicating influence were the longest and
were as much as 30 seconds in duration. These instances
were then used to train and test the predictors described in
the next sections. The dataset is described in Table 1.
Fig. 4. Average normalized accuracy of predicting the honest
signals while varying a set of features used.
Table 1. Description of the collected dataset. The last row
contains the total instances / total durations of time slices.
Activity
Consistency
Mimicry
High
Low
High
Low
High
Low
High
99 /
1951s
21 /
673s
40 /
574s
60 /
1140s
48 /
391s
41 /
356s
48 /
211s
6.2. Predicting honest signals
Next, we built predictors for the honest signals. For each
second of labeled data, we assembled a vector of
intermediate-level features as described in the previous
section. On these features, we performed binary (i.e., high
vs. low) classification using linear logistic regression for all
four honest signals. The episodes of low influence were
used as a proxy for episodes of low mimicry. Also, we used
a leave-one-out strategy, where training was first performed
on instances from nine conversation pairs and tested on the
instances from tenth. The accuracies were averaged over the
complete cycle of leave-one-out and thus reflect predictive
power independent of participants in the conversation.
Also note that while activity and consistency are
personal traits, influence and mimicry reflect the interaction
between people. Consequently, we categorized a subset of
the extracted features into two groups as shown in Table 2.
Table 2. 11 features selected to estimate the honest signals.
Category I
Category II
Average speaking rate
Average length of barge-in
Variation of pitch
Average length of grant-floor
Variation of syllabic rate
Ave. length of suppression
Variation of spectral dist.
Frequency of barge-in
Average head motion rate
Frequency of grant-floor
Frequency of suppression
The features in Category I are personal properties, while the
features in Category II are properties of how one person
interacts with another. Features in both categories can be
computed for both sides of a conversation. We hypothesized
that determining a person‟s influence and mimicry depends
crucially on joint features from both sides, while activity
and consistency depend only on one-sided personal features.
To test our hypothesis, we constructed four different
systems. First, we considered two baseline systems that used
all 26 features described in Section 5 (including all means
and variances, a superset of Table 2) but differed in whether
the features were from either individual or both participants.
Then, we had another system that included all the features
mentioned in Table 2 (Categories I and II) from both
participants (11 from each, for a total of 22 dimensions).
Finally, we designed a custom system where we included
Category I and II features only from one side for activity
and consistency, from both sides for mimicry, and just the
Category II features along with speaking rate for influence.
In sum, we had the following setups:
System I: All 26 features from both sides (total 52 dim.)
System II: All 26 features from one side (total 26 dim.)
System III: 11 features from both sides (total 22 dim.)
System IV: Custom as described above.
Fig. 4 presents the average recognition results of the
above four systems. We observe that the accuracies obtained
for activity and consistency are higher for System II than I,
highlighting that those two signals are more dependent on
individual traits as opposed to the joint features. The same
tendency is observed when comparing System III with IV.
On the other hand, comparing System I vs. II, better
accuracies are obtained for influence and mimicry in System
I, which verifies our hypothesis about these two honest
signals. Furthermore, the custom system, which takes into
account the different personal traits and joint properties of
the conversation into the classification, performs best
showing significant predictive power of the features. In
particular, we see a significant improvement in accuracy for
activity and consistency from System III to IV (from 69% to
92% and from 65% to 77%, respectively). Moreover, the
improvement in accuracy for influence of System IV against
System III (67% to 76%) reflects that turn-taking features
are fairly important for predicting influence. Overall, the
results indicate that the non-linguistic signals have enough
discriminative power to predict the four dimensions
characterizing the honest signals. These honest signals thus
can be estimated from non-verbal behavior and presented
back to users while video conferencing.
6.3. Conversation type prediction
We also explored whether non-linguistic cues have enough
predictive power to perform conversation type prediction.
We first categorized the five conversations that took place
during the first user study into four conversation types: (a)
negotiation (e.g., salary negotiation), (b) active listening
(listening to personal opinions and dilemmas), (c) exploring
(exploring what is in common) and (d) brainstorming. We
then used a one-vs.-all formulation of logistic regression
with the entire set of extracted features (System I). Table 3
summarizes the performances.
We observed an average accuracy of 77% across all the
conversation classes. The system performed best in
predicting the conversation class of „Exploring‟ with 83%
accuracy, while „Listening‟ achieved the worst performance
with 72% accuracy. Note that our features do not encode
any semantics or content of the conversation. This suggests
that non-verbal cues can provide important information
about the conversation type. Thus, it might eventually be
useful in predicting the final success of the video conference.
Table 3. Accuracy for conversation type prediction.
Conversation Type
Accuracy (%)
Negotiation
76
Active listening
72
Exploring
83
Brainstorming
77
Average
77
6.4. User perception and survey
In the second user study, after participating in conversations
of the five different types and playing different roles,
participants were queried as to how valuable they found the
new system, which features they liked or did not find
valuable, and whether or not such a tool would influence
their conversations. While users varied to the degree that
they were able to use the feedback in real time, most agreed
that the system could in fact provide value for them during
video conferences. Most participants found the speaking
time and proportion to be extremely useful, and said that the
system did alter their behaviors, whether they used the
system as a “coach” to speak less, or as a “prompt” to speak
more. Some participants really liked the consistency and
influence signals, but many participants wished they could
see the signals for the other person as well, to compare how
they were doing in reference to their partner. Participants
also asked for a cumulative view of how they had been
behaving, not just a summary of the last five minutes, for
retrospective analysis. A couple of participants liked the
activity feedback, with one participant even stating that it
would be very useful for her child with attention deficit
disorder (ADD) when using Skype. Overall, the system was
well received, but it was clear that we needed to iterate on
the user interface to make it less distracting, more
glanceable, and cumulative for the whole session.
7. CONCLUSION AND FUTURE WORK
We set out to explore whether or not there was viability in
building a system that could automatically analyze non-
verbal signals for video conference participants, enhance
their conversations with appropriate feedback, and also
provide predictive information about characteristics of the
conversation. An early wizard-of-oz user study provided
initial evidence that the concept had merit, but that the
feedback user interface needed iteration. We built the
system and, using real user data, were able to show that we
could not only accurately assess the nonverbal signals we
were interested in, but we could also make predictions about
meeting types and user roles, and that users found the
feedback valuable for modulating their video conference
conversation. This is the first known investigation of these
kinds of predictions and evaluations using actual audio and
visual signals automatically via a videoconferencing system.
We believe that there is much more that can be done beyond
the work explored in this paper, for example: exploration of
alternate UI designs, incorporation of richer modalities such
as facial expressions and physiology, and extension to
multi-party conversations.
8. REFERENCES
[1] T. Choudhury, “Sensing and modeling human network,”
Ph.D thesis, MIT, 2004.
[2] A. Pentland, Honest Signals: How they shape our
world, MIT Press, 2008.
[3] S. Basu, “Conversational Scene Analysis,” Ph.D thesis,
MIT, 2002.
[4] R. Picard, Affective computing, MIT Press, 1997.
[5] P. Eckman and E. L. Rosenberg, What the Face Reveals:
Basic and Applied Studies of Spontaneous Expression
Using the Facial Action Coding System (FACS), Oxford
University Press, 2nd edition, 2004.
[6] A. Kapoor, W. Burleson, and R. Picard, “Automatic
prediction of frustration,” Int’l J. Human-Computer
Studies, vol. 65, 724-736, 2007.
[7] J. N. Bailenson et al., Real-time classification of
evoked emotions using facial feature tracking and
physiological responses,” Intl J. Human Machine
Studies, vol. 66, 303-317, 2008.
[8] H. C. van Vugt et al., “Effects of Facial Similarity on
User Responses to Embodied Agents,” ACM Trans.
Computer-Human Interaction, vol. 17, 2010.
[9] K. Otsuka et al., Conversation Scene Analysis with
Dynamic Bayesian Network Based on Visual Head
Tracking,” Proc. ICME, 2006.
[10] D. Jayagopi et al., “Characterizing conversational group
dynamics using nonverbal behaviour, Proc. ICME,
2009.
[11] J. M. Dimicco, “Changing small group interaction
through visual reflection of social behavior,” Ph.D
thesis, MIT, 2005.
[12] J. Sturm et al., Influencing social dynamics in
meetings through a peripheral display,” Proc. ICMI,
2007.
[13] T. F. Quatieri, Discrete-time speech signal processing:
Principles and Practice, Prentice Hall, 2001.
[14] R. Gray et al., “Distortion measures for speech
processing,” IEEE Trans. Acoustics, Speech, and Signal
Processing, vol. 28, no. 4, 1980.
[15] C. Zhang and Z. Zhang, “A survey of recent advances
in face detection,” Microsoft Research Tech. Report,
2010.
[16] B. Fasei and J. Luettin, “Automatic facial expression
analysis: a survey,Pattern Recognition, vol. 36, no. 1,
2003.
[17] R. Barra et al., “Prosodic and segmental rubrics in
emotion identification,” Proc. ICASSP, 2006.
[18] A. Kapoor and R. Picard “A real-time head nod and
shake detector,” Proc. PUI, 2001.
... While VC naturally affords transmitting visual and nonverbal cues exchanged in an interaction, research has also begun to explore computationally detecting and representing those cues. Byun et al. [13] explored analyzing audio and video streams in a video call to detect visual and non-verbal cues and display them in real time to the interviewees to make their calls more successful. They found that users appreciated the feedback to help them manage their VC behavior. ...
... For example, it can help them to decide when to transition from one topic to another, or when to end a meeting. Byun et al. created a system that used gestural and nonverbal cues to indicate to VC participants how well their conversations were going [13]. Algorithms may still miss subtle cues and/or misinterpret expressions; there is still a long way to go before machines are close to the level of a human at this task. ...
Article
Full-text available
Video calling (VC) aims to create multi-modal, collaborative environments that are "just like being there." However, we found that autistic individuals, who exhibit atypical social and cognitive processing, may not share this goal. We interviewed autistic adults about their perceptions of VC compared to other computer- mediated communications (CMC) and face-to-face interactions. We developed a neurodiversity-sensitive model of CMC that describes how stressors such as sensory sensitivities, cognitive load, and anxiety, contribute to their preferences for CMC channels. We learned that they apply significant effort to construct coping strategies to support their sensory, cognitive, and social needs. These strategies include moderating their sensory inputs, creating mental models of conversation partners, and attempting to mask their autism by adopting neurotypical behaviors. Without effective strategies, interviewees experience more stress, have less capacity to interpret verbal and non-verbal cues, and feel less empowered to participate. Our findings reveal critical needs for autistic users. We suggest design opportunities to support their ability to comfortably use VC, and in doing so, point the way towards making VC more comfortable for all.
... (See. for example, two more recent papers on feedback systems: Strum et al., 2007 andByun et al., 2011). ...
... The automated system measured the time durations of each of the separate variables and visually represents (mirrors) the outcome to the participants on the table's surface through projection as feedback -but in so doing asserted that equal participation time was a desired goal. Byun et al. (2011) demonstrated that thorough usability testing/prototyping seems to be crucial to ensure that the system will not be overly complicated, uninviting, unclear and or confusing in its use. Byun and Strum together illustrated that a feedback system which merely mirrors through quantitative data feedback with minimal qualitative analysis can lead to an incorrect image of the social dynamics within group situations. ...
Article
Full-text available
The role of immediate feedback in-group conversations has received scant attention in the recent literature. While studies from the early 1990's suggested that "added information" in the form of non-verbal cues would allow video conferencing to "augment" the audio-only conference in terms of effectiveness, stunningly little follow-on research has been done reflective of the current state of computer mediated communication, video conferencing, "live walls", etc. This article contrasts three studies of immediate feedback in in-person settings as the basis for suggesting a new research program – research to look at potential effects of augmenting video-conferencing with an immediate feedback channel. http://mc.manuscriptcentral.com/jbtc
... The author attributes this difference to the fact that communication via computer-mediated discussion is devoid of non-verbal cues and prioritizes text over interaction (Willson 2000, 410-15). Byun et al. (2011) demonstrate the importance of gestural and non-verbal cues in video conferencing termed "honest signals," which are behavioral communications about the speaker's underlying state that are shared unknowingly. ...
Article
Full-text available
his paper examines the shift to remote participation in planning board hearings during the outbreak of COVID-19. Using the results of an exploratory survey among 182 planners, public officials, and stakeholders, we explore perceptions about this transition, compare online and face-to-face engagements, and discuss the benefits and pitfalls of video-conference meetings. The findings indicate that video conferencing in planning merits future use, yet it also highlights key limitations of virtual meetings. Regardless of the findings here, the long-term effects of video conferencing and online decision-making remain to be seen.
... For example, in a study done by Otsuka 22 et al. (2006), a visual head-tracking technique was employed to analyse the interaction between people from audio-23 visually recorded meetings. Another work, by Byun et al. (2011), analysed nonverbal cues, including head nods and 24 eye gazing of participants in video conferencing, and presented a user study that shows that their predictive framework 25 has the capability to positively influence the conversations in the video conference. An exhaustive survey on past 26 efforts exploring social cues in human behaviour using vision-based, audio-based, and audio-visual analysis, can be 27 seen in Vinciarelli et al. (2008) and Zeng et al. (2009). ...
Article
Full-text available
Nonverbal communication is an important part of human communication, including head nodding, eye gaze, proximity and body orientation. Recent research has identified specific patterns of head nodding linked to conversation, namely mimicry of head movements at 600ms delay and fast nodding when listening. In this paper, we implemented these head nodding behaviour rules in virtual humans, and we tested the impact of these behaviours, and whether they lead to increases in trust and liking towards the virtual humans. We use Virtual Reality technology to simulate a face-to-face conversation, as VR provides a high level of immersiveness and social presence, very similar to face-to-face interaction. We then conducted a study with human-subject participants, where the participants took part in conversations with two virtual humans and then rated the virtual character social characteristics, and completed an evaluation of their implicit trust in the virtual human. Results showed more liking for and more trust in the virtual human whose nodding behaviour was driven by realistic behaviour rules. This supports the psychological models of nodding and advances our ability to build realistic virtual humans.
... It also deals with very few feedback features to sufficiently explain the group dynamics. Byun et al. [53] designed automated real-time feedback for two-party conversation on a wide range of topics over videoconferencing. They found the positive effects of increased awareness within participants because of real-time feedback, even though the topics are rather diverse. ...
Preprint
Full-text available
Having a group discussion with the members holding conflicting viewpoints is difficult. It is especially challenging for machine-mediated discussions in which the subtle social cues are hard to notice. We present a fully automated videochat framework that can automatically analyze audio-video data of the participants and provide real-time feedback on participation, interruption, volume, and facial emotion. In a heated discourse, these features are especially aligned with the undesired characteristics of dominating the conversation without taking turns, interrupting constantly, raising voice, and expressing negative emotion. We conduct a treatment-control user study with 40 participants having 20 sessions in total. We analyze the immediate and the reflective effects of real-time feedback on participants. Our findings show that while real-time feedback can make the ongoing discussion significantly less spontaneous, its effects propagate to successive sessions bringing significantly more expressiveness to the team. Our explorations with instant and propagated impacts of real-time feedback can be useful for developing design strategies for various collaborative environments.
... Although there are automatic methods to recognize moods and emotions, e.g., from video [Byun et al. 2011], our approach is deliberately based on manual self-tracking. On the one hand, automatic methods lower the tracking effort, can raise awareness, and influence behavior [DiMicco 2005]. ...
Article
The benefits of self-tracking have been thoroughly investigated in private areas of life, like health or sustainable living, but less attention has been given to the impact and benefits of self-tracking in work-related settings. Through two field studies, we introduced and evaluated a mood self-tracking application in two call centers to investigate the role of mood self-tracking at work, as well as its impact on individuals and teams. Our studies indicate that mood self-tracking is accepted and can improve performance if the application is well integrated into the work processes and matches the management style. The results show that (i) capturing moods and explicitly relating them to work tasks facilitated reflection, (ii) mood self-tracking increased emotional awareness and this improved cohesion within teams, and (iii) proactive reactions by managers to trends and changes in team members' mood were key for acceptance of reflection and correlated with measured improvements in work performance. These findings help to better understand the role and potential of self-tracking at the workplace, and further provide insights that guide future researchers and practitioners to design and introduce these tools in a work setting.
... Future research still needs to determine whether such a tool would also work in a different cultural setting and different application areas (e.g. virtual teamwork, elearning or online psychotherapy), however studies using a similar tool for selffeedback of affective states (AffectAura, c.f. [38]) or for honest signals in video conferencing [39] hinted already at the usefulness of such an instrument in similar application areas. The findings of this study are encouraging to continue the enhancement of the EmotiBoard to a team support system that automatically detects and represents moods in team work. ...
Conference Paper
Full-text available
This article examines the influence of mood feedback on different outcomes of teamwork in two different collaborative work environments. Employing a 2 x 2 between-subjects design, mood feedback (present vs. not present) and communication mode (face-to-face vs. video conferencing) were manipulated experimentally. We used a newly developed collaborative communication environment, called EmotiBoard, which is a large vertical interactive screen, with which team members can interact in a face-to-face discussion or as a spatially distributed team. To support teamwork, this tool provides visual feedback of each team member’s emotional state. Thirty-five teams comprising 3 persons each (with a confederate in each team) completed three different tasks, measuring mood, performance, subjective workload, and team satisfaction. Results indicated that the evaluation of the other team members’ emotional state was more accurate when the mood feedback was presented. In addition, mood feedback influenced team performance positively in the video conference condition and negatively in the face-to-face condition. Furthermore, participants in the video conference condition were more satisfied after task completion than participants in the face-to-face condition. Findings indicate that the mood feedback tool is helpful for teams to gain a more accurate understanding of team members’ emotional states in different work situations.
Preprint
Full-text available
Experiential learning has been known to be an engaging and effective modality for personal and professional development. The Metaverse provides ample opportunities for the creation of environments in which such experiential learning can occur. In this work, we introduce a novel architecture that combines Artificial intelligence and Virtual Reality to create a highly immersive and efficient learning experience using avatars. The framework allows us to measure the interpersonal effectiveness of an individual interacting with the avatar. We first present a small pilot study and its results which were used to enhance the framework. We then present a larger study using the enhanced framework to measure, assess, and predict the interpersonal effectiveness of individuals interacting with an avatar. Results reveal that individuals with deficits in their interpersonal effectiveness show a significant improvement in performance after multiple interactions with an avatar. The results also reveal that individuals interact naturally with avatars within this framework, and exhibit similar behavioral traits as they would in the real world. We use this as a basis to analyze the underlying audio and video data streams of individuals during these interactions. Finally, we extract relevant features from these data and present a machine-learning based approach to predict interpersonal effectiveness during human-avatar conversation. We conclude by discussing the implications of these findings to build beneficial applications for the real world.
Article
Video telehealth is growing to allow more clinicians to see patients from afar. As a result, clinicians, typically trained for in-person visits, must learn to communicate both health information and non-verbal affective signals to patients through a digital medium. We introduce a system called ReflectLive that senses and provides real-time feedback about non-verbal communication behaviors to clinicians so they can improve their communication behaviors. A user evaluation with 10 clinicians showed that the real-time feedback helped clinicians maintain better eye contact with patients and was not overly distracting. Clinicians reported being more aware of their non-verbal communication behaviors and reacted positively to summaries of their conversational metrics, motivating them to want to improve. Using ReflectLive as a probe, we also discuss the benefits and concerns around automatically quantifying the "soft skills" and complexities of clinician-patient communication, the controllability of behaviors, and the design considerations for how to present real-time and summative feedback to clinicians.
Conference Paper
This paper presents a fusion approach called Image and Signal Analysis of Multimedia Content (ISAMC) to provide a fully evolved model for emotion recognition using both external (face) and internal (EEG signals) characteristics for the same emotional phenomenon. Both image analysis and EEG signal analysis is done using a video stimulus and based on wavelet approach for feature extraction. This novel methodology provides cross-validation of EEG and Image results with self-assessment of the participants and encourages multi-classification with the use of two different classifiers. The encouraging experimental results prove that the efficiency of this method is very high and due to its simplicity it can be a promising tool for emotion recognition.
Article
Full-text available
Over the last decade, automatic facial expression analysis has become an active research area that finds potential applications in areas such as more engaging human-computer interfaces, talking heads, image retrieval and human emotion analysis. Facial expressions reflect not only emotions, but other mental activities, social interaction and physiological signals. In this survey we introduce the most prominent automatic facial expression analysis methods and systems presented in the literature. Facial motion and deformation extraction approaches as well as classification methods are discussed with respect to issues such as face normalization, facial expression dynamics and facial expression intensity, but also with regard to their robustness towards environmental changes.
Book
Full-text available
Face detection has been one of the most studied topics in the computer vision literature. In this technical report, we survey the recent advances in face detection for the past decade. The seminal Viola-Jones face detector is first reviewed. We then survey the various techniques according to how they extract features and what learning algorithms are adopted. It is our hope that by reviewing the many existing algorithms, we will see even better algorithms developed to solve this fundamental computer vision problem.
Conference Paper
Full-text available
It is well known that the emotional state of a speaker usually alters the way she/he speaks. Although all the components of the voice can be affected by emotion in some statistically-significant way, not all these deviations from a neutral voice are identified by human listeners as conveying emotional information. In this paper we have carried out several perceptual and objective experiments that show the relevance of prosody and segmental spectrum in the characterization and identification of four emotions in Spanish. A Bayes classifier has been used in the objective emotion identification task. Emotion models were generated as the contribution of every emotion to the build-up of a universal background emotion codebook. According to our experiments, surprise is primarily identified by humans through its prosodic rubric (in spite of some automatically-identifiable segmental characteristics); while for anger the situation is just the opposite. Sadness and happiness need a combination of prosodic and segmental rubrics to be reliably identified
Conference Paper
Full-text available
A novel method based on a probabilistic model for con- versation scene analysis is proposed that can infer conversa- tion structure from video sequences of face-to-face commu- nication. Conversation structure represents the type of con- versation such as monologue or dialogue, and can indicate who is talking / listening to whom. This study assumes that the gaze directions of participants provide cues for discerning the conversation structure, and can be identified from head di- rections. For measuring head directions, the proposed method newly employs a visual head tracker based on Sparse-Template Condensation. The conversation model is built on a dynamic Bayesian network and is used to estimate the conversation structure and gaze directions from observed head directions and utterances. Visual tracking is conventionally thought to be less reliable than contact sensors, but experiments confirm that the proposed method achieves almost comparable perfor- mance in estimating gaze directions and conversation struc- ture to a conventional sensor-based method.
Conference Paper
Full-text available
This paper addresses the novel problem of characterizing conversational group dynamics. It is well documented in social psychology that depending on the objectives a group, the dynamics are different. For example, a competitive meeting has a different objective from that of a collaborative meeting. We propose a method to characterize group dynamics based on the joint description of a group members' aggregated acoustical nonverbal behaviour to classify two meeting datasets (one being cooperative-type and the other being competitive-type). We use 4.5 hours of real behavioural multi-party data and show that our methodology can achieve a classification rate of upto 100%.
Article
Knowledge of how groups of people interact is important in many disciplines, e.g. organizational behavior, social network analysis, knowledge management and ubiquitous computing. Existing studies of social network interactions have either been restricted to online communities, where unambiguous measurements about how people interact can be obtained (available from chat and email logs), or have been forced to rely on questionnaires, surveys or diaries to get data on face-to-face interactions between people. The aim of this thesis is to automatically model face-to-face interactions within a community. The first challenge was to collect rich and unbiased sensor data of natural interactions. The "sociometer", a specially designed wearable sensor package, was built to address this problem by unobtrusively measuring face-to-face interactions between people. Using the sociometers, 1518 hours of wearable sensor data from 23 individuals was collected over a two-week period (66 hours per person). This thesis develops a computational framework for learning the interaction structure and dynamics automatically from the sociometer data. Low-level sensor data are transformed into measures that can be used to learn socially relevant aspects of people's interactions - e.g. identifying when people are talking and whom they are talking to. The network structure is learned from the patterns of communication among people. The dynamics of a person's interactions, and how one person's dynamics affects the other's style of interaction are also modeled. Finally, a person's style of interaction is related to the person's role within the network. The algorithms are evaluated by comparing the output against hand-labeled and survey data.
Article
While we have known for centuries that facial expressions can reveal what people are thinking and feeling, it is only recently that the face has been studied scientifically for what it can tell us about internal states, social behavior, and psychopathology. Today's widely available, sophisticated measuring systems have allowed us to conduct a wealth of new research on facial behavior that has contributed enormously to our understanding of the relationship between facial expression and human psychology. The chapters in this volume present the state-of-the-art in this research. They address key topics and questions, such as the dynamic and morphological differences between voluntary and involuntary expressions, the relationship between what people show on their faces and what they say they feel, whether it is possible to use facial behavior to draw distinctions among psychiatric populations, and how far research on automating facial measurement has progressed. © 1997, 2005 by Oxford University Press, Inc. All rights reserved.
Book
How can you know when someone is bluffing? Paying attention? Genuinely interested? The answer, writes Alex Pentland in Honest Signals, is that subtle patterns in how we interact with other people reveal our attitudes toward them. These unconscious social signals are not just a back channel or a complement to our conscious language; they form a separate communication network. Biologically based "honest signaling," evolved from ancient primate signaling mechanisms, offers an unmatched window into our intentions, goals, and values. If we understand this ancient channel of communication, Pentland claims, we can accurately predict the outcomes of situations ranging from job interviews to first dates. Pentland, an MIT professor, has used a specially designed digital sensor worn like an ID badge--a "sociometer"--to monitor and analyze the back-and-forth patterns of signaling among groups of people. He and his researchers found that this second channel of communication, revolving not around words but around social relations, profoundly influences major decisions in our lives--even though we are largely unaware of it. Pentland presents the scientific background necessary for understanding this form of communication, applies it to examples of group behavior in real organizations, and shows how by "reading" our social networks we can become more successful at pitching an idea, getting a job, or closing a deal. Using this "network intelligence" theory of social signaling, Pentland describes how we can harness the intelligence of our social network to become better managers, workers, and communicators.
Article
Predicting when a person might be frustrated can provide an intelligent system with important information about when to initiate interaction. For example, an automated Learning Companion or Intelligent Tutoring System might use this information to intervene, providing support to the learner who is likely to otherwise quit, while leaving engaged learners free to discover things without interruption. This paper presents the first automated method that assesses, using multiple channels of affect-related information, whether a learner is about to click on a button saying “I’m frustrated.” The new method was tested on data gathered from 24 participants using an automated Learning Companion. Their indication of frustration was automatically predicted from the collected data with 79% accuracy (chance=58%). The new assessment method is based on Gaussian process classification and Bayesian inference. Its performance suggests that non-verbal channels carrying affective cues can help provide important information to a system for formulating a more intelligent response.
Article
We present automated, real-time models built with machine learning algorithms which use videotapes of subjects’ faces in conjunction with physiological measurements to predict rated emotion (trained coders’ second-by-second assessments of sadness or amusement). Input consisted of videotapes of 41 subjects watching emotionally evocative films along with measures of their cardiovascular activity, somatic activity, and electrodermal responding. We built algorithms based on extracted points from the subjects’ faces as well as their physiological responses. Strengths of the current approach are (1) we are assessing real behavior of subjects watching emotional videos instead of actors making facial poses, (2) the training data allow us to predict both emotion type (amusement versus sadness) as well as the intensity level of each emotion, (3) we provide a direct comparison between person-specific, gender-specific, and general models. Results demonstrated good fits for the models overall, with better performance for emotion categories than for emotion intensity, for amusement ratings than sadness ratings, for a full model using both physiological measures and facial tracking than for either cue alone, and for person-specific models than for gender-specific or general models.