Content uploaded by Chi-Chun Lee
Author content
All content in this area was uploaded by Chi-Chun Lee on Apr 25, 2014
Content may be subject to copyright.
An Analysis of PCA-based Vocal Entrainment Measures in Married Couples’
Affective Spoken Interactions
Chi-Chun Lee1, Athanasios Katsamanis1, Matthew P. Black1,
Brian R. Baucom2, Panayiotis G. Georgiou1, Shrikanth S. Narayanan1,2
1Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA
2Department of Psychology, University of Southern California, Los Angeles, CA, USA
http://sail.usc.edu
Abstract
Entrainment has played a crucial role in analyzing marital cou-
ples interactions. In this work, we introduce a novel technique
for quantifying vocal entrainment based on Principal Compo-
nent Analysis (PCA). The entrainment measure, as we define
in this work, is the amount of preserved variability of one inter-
locutor’s speaking characteristic when projected onto represent-
ing space of the other’s speaking characteristics. Our analysis
on real couples interactions shows that when a spouse is rated
as having positive emotion, he/she has a higher value of vocal
entrainment compared when rated as having negative emotion.
We further performed various statistical analyses on the strength
and the directionality of vocal entrainment under different affec-
tive interaction conditions to bring quantitative insights into the
entrainment phenomenon. These analyses along with a base-
line prediction model demonstrate the validity and utility of the
proposed PCA-based vocal entrainment measure.
Index Terms: vocal entrainment, couples therapy, behavioral
signal processing, principal component analysis
1. Introduction
In a dyadic spontaneous spoken interaction, the interlocutors
exert mutual influence on each other’s behaviors. This mutual
influence on the dyad’s behaviors guides the dynamic flow of
the interaction. It is in this context, the term - interaction syn-
chrony, a.k.a entrainment, is used to describe the phenomenon
of a naturally occurring coordination between interacting indi-
viduals’ behaviors both in timing and form. There has been re-
search works on attempting to quantify specific entrainment be-
haviors, such as voice activity rhythm [1] and gestures [2], and
they have shown that it is essential to apply quantitative meth-
ods for analyzing interpersonal interaction dynamics in fine de-
tails. Entrainment in conversation describes an important as-
pect of human interaction dynamics, since it is believed that
variations in the pattern of entrainment phenomenon can offer
insights into the behaviors of the interacting individuals; this
is especially critical in understanding interaction patterns when
the underlying behavior is deemed atypical or distressed. This
has inspired the investigation of new computational approaches,
referred as behavioral signal processing (BSP), to problems in
mental health such as couples therapy, addiction behavior, de-
pression, and autism spectrum disorder diagnosis/analysis. The
aim of BSP is to automatically analyze abstract human behav-
iors/states from low level signal measurements such as from au-
dio and video recordings of interactions. In this work, we at-
tempt to quantify vocal entrainment in the spoken interactions
of married couples engaged in affective problem-solving ses-
sions during marital therapy using such signal processing tech-
niques. A major motivation for this quantitative study of vo-
cal entrainment comes from various psychological studies that
have stated the importance of entrainment phenomenon in un-
derstanding the nature of couples’ interactions [3].
Across a variety of research domains, e.g., econometrics,
neuroscience, physical coupled system studies, etc, a long list
of synchronization measures [4] have been utilized to quan-
tify interdependence between time series and associated vari-
ables. These measures often lack straightforward methods to
handle complex interaction scenarios like human-human con-
versations, where the analysis window length (e.g., length of
each speaking turn) per channel (e.g., a speaker in the conver-
sation) varies across time and speakers; the signals associated
with human conversations can also be very abstract and com-
plex. The two variables in the time series (corresponding to
the interlocutors in the dyad) do not occur simultaneously be-
cause of the inherent turn-taking structure of human conversa-
tions. These phenomenon often violate the underlying assump-
tion when applying classical synchronization measures on sig-
nals of interest. Furthermore, majority of these measures are
symmetric measures that do not provide information on the di-
rections of synchronization.
In order to improve upon our previous work [5] of quan-
tifying vocal entrainment, we incorporate an expanded list of
vocal features and derive a new quantitative vocal entrainment
measure based on Principal Component Analysis (PCA). In this
work, we propose the quantification of vocal entrainment as the
amount of variability preserved when representing a speaker’s
(say, SP1) vocal characteristics in the vocal characteristics space
of another speaker (say, SP2). The vocal characteristics space
is constructed using PCA with acoustic cues. Intuitively, the
larger the amount of variability preserved, the higher the vocal
entrainment level. This method can address both of the afore-
mentioned concerns because of its utilization of projecting vo-
cal features onto the transformed vocal characteristic subspace
for any variable length of speech features. Furthermore, for a
given speaker pair (say SP1, SP2) in an interaction, this method
can generate two directions of vocal entrainment when we look
at any single speaker, say SP1: one corresponds to how much
SP1 is entraining toward SP1, and the other corresponds to how
much SP1 is getting entrained from SP2.
Various psychology research studies [6, 7] and our own pre-
vious work [5] indicate that the general existence of a higher
level of entrainment when a spouse is rated as having posi-
tive affect compared to rated as having negative affect. While
the relationship between entrainment phenomenon and emotion
can be complex [8], we rely on this general trend to investi-
Copyright © 2011 ISCA 28
-
31 August 2011, Florence, Italy
INTERSPEECH 2011
3101
gate the use of the proposed PCA-based vocal entrainment mea-
sures. Our analysis on the directionality of entrainment further
indicates that when a spouse is rated as having positive affect,
he/she shows statistically significant more vocal entrainment to-
ward his/her interacting partners, but is not eliciting entrain-
ment from his/her interacting partners. Finally, we use support
vector machine (SVM) to design a baseline prediction model for
classifying session-level code of high positive vs. high negative
affect on each spouse using only vocal entrainment measure.
The paper is organized as follows: we describe the database
and research methodology in section2. Experiment setup and
results are in section 3, and conclusions are in section 4.
2. Research Methodology
2.1. Database
The data that we are using was collected as part of the largest
longitudinal, randomized control trial of psychotherapy for
severely and stably distressed couples [9]. The database con-
sists of audio-visual data recordings: a single channel far-field
microphone, split screen videos, and observation coding on the
behaviors of these real married couples. Multiple trained evalu-
ators were instructed to code the behaviors of each spouse using
the two standard manual codings, the Social Support Interaction
Rating System (SSIRS) and the Couples Interaction Rating Sys-
tem (CIRS), resulting in 33 session-level codes for each spouse
on their interaction. There are a total of 569 sessions (117
unique couples) of couples engaging in problem solving inter-
actions in which an issue in their relationship was raised and
discussed. Since the manual transcripts are available, the au-
dio data was automatically segmented into pseudo-turns (with
speaker identification: husband, wife, unknown) and aligned to
the word transcripts using a software, SailAlign [10]. These
pseudo-turns are considered as speaking turns in this research
work because they correspond to the speaking portion of the
same speaker before the other speaker takes over the floor. The
audio data qualities vary a lot from session to session; therefore,
we use only a subset of 372 sessions out of 569 sessions because
they meet the criteria of 5 dB SNR and 55% speaker segmen-
tations after this automatic process. Details of the database can
be found in the previous work [11].
The focus of this work is to quantitatively examine the vocal
entrainment of married couples in sessions where either spouse
was rated with extreme affective states (positive & negative).
The emotional rating is the code “Global Positive Affect” and
“Global Negative Affect” (based on SSIRS) on each spouse at
the session level. We focus on those sessions out of the 372 ses-
sions that spouse was rated in top 20% of positive and negative
emotion on the sessions and denote them as being high positive
emotion and high negative emotion in this work. Based on this
selection of extreme affective states, it results in a total number
of 280 sessions with 81 unique couples of which 140 sessions
correspond to high positive emotion and another 140 sessions
correspond to high negative emotion to be used in this work.
2.2. PCA-based Vocal Entrainment Measures
The core idea behind this quantification of vocal entrainment
is to construct a basis set representing speaking characteristics
space of an interlocutor per speaking turn using PCA. The en-
trainment level is essentially defined as a measure of similarity
when projecting another interlocutor’s speaking characteristics
onto this constructed space of speaking characteristics; in this
case, the metric is the amount of preserved variance of vocal
features from one interlocutor while projecting onto the other
Figure 1: Example of Computing Two Directions of Vocal En-
trainment for Turns Hi
interlocutor. A schematic example of how to compute the two
directions (toward:veT O ,from:veF R ) of vocal entrainment
for an interlocutor’s, husband, speech turn, Hi, in an married
couple interaction is shown in Figure 1. The steps listed below
are used to compute the husband’s veT O at turn Hi:
1. Extract appropriate vocal features, X1, to represent hus-
band’s speaking characteristics at turn Hi.
2. Perform PCA on z-normalized of X1, such that YT
1=
D1XT
1.
3. Predefined a variance level (v1= 0.95) to select L-
subset of basis vectors, D1L.
4. Project the z-normalized vocal features, X2extracted
from wife’s speech at turn Wi, using D1L.
5. Compute vocal entrainment measure as the ratio of rep-
resented variance of X2, in W1Lbasis, and the prede-
fined variance level in step 3.
We can compute the other direction of entrainment, veFR ,
by interchanging X1with X2. There are two major motivations
behind these PCA-based vocal entrainment measures. First is
the elimination of concerns associated with imposing heuristics
in the computation of conventional synchronization measures
due to the the turn-taking structures of human conversation and
variable length of speaking turns (resulting in different number
of vocal feature vectors sequences per speaking turn). These
two factors can raise concerns on the reliability of using classi-
cal measures. However, with the PCA-based measures, because
the representation resides in another transformed space, the is-
sues of non-simultaneously occurring time series and variable-
length analysis chunks are both lessened. Second is the ability
to introduce the directionality of entrainment at each speaking
turn per speaker. As we can see from Figure 1, there can be
two directions of vocal entrainment for a given spouse at each
of his/her speaking turn. This directionality can be important to
understand the details of entrainment phenomenon.
2.3. Representative Vocal Feature Set
The method describes in Section 2 relies on an appropriate set of
acoustic features to represent speaking characteristics. In order
to capture the dynamics of the speaking characteristics, the PCA
is done on a speaking turn where the vocal features are com-
puted at the word level. There are two different categories of
vocal features used in this work: prosodic features and spectral
features. The details of the raw acoustic extractions from audio
files with necessary preprocessing and speaker-normalization
are described in previous work [11]. The following is the list
of final set of acoustic features calculated per word (resulting
from automatic alignment) to represent the speaking character-
istics.
•Prosodic Features (Pitch x4) : third-order polynomial fit
on the pitch contour per word.
•Prosodic Features (Energy x2) : mean and variance of
the energy per word.
•Prosodic Features (Word Duration x1): the word dura-
tion.
•Spectral Features (MFCC 2x15): mean and variance of
15 dimensional MFCC per word.
3102
This list combined with the first order delta features gen-
erates a 74-dimensional (37x2) vocal feature vector per word.
Depending on the length of speaking turns, it would result in
a variable length of 74-dimensional vocal feature sequences.
PCA are performed on the merged-turns, (merging speaking
turns into merged-turn such that it has least 74 samples), in or-
der to generates a unique set of basis vectors.
3. Experiment Setups & Results
Three different experiments were set up to analyze different as-
pects of this PCA-based vocal entrainment measure.
•Experiment I: To investigate if the proposed PCA-based
vocal entrainment measures offer a reasonable quantifi-
cation of vocal entrainment using two different hypothe-
sis testings.
•Experiment II: To analyze the direction of the PCA-
based vocal entrainment under different conditions of af-
fective married couples’ interactions.
•Experiment III: To discriminate affective state using
this PCA-based vocal entrainment measure as features
with Support Vector Machine.
3.1. Experiment I
We used two different approaches in verifying that the PCA-
based vocal measure is indeed a viable quantitative measure of
vocal entrainment. First, we rely on the fact that there has been
a general understanding that when couples are engaged in an
interpersonal interaction with a more positive emotion, the en-
trainment level is expected to be higher than with a negative
emotion. The second is to show that the entrainment measures
computed this way between interacting couples is statistically
higher than computing the entrainment measure between ran-
dom pair of couples not engaged in conversations.
3.1.1. Hypothesis Testings Setup & Results
The first hypothesis testing was to verify the approach of PCA-
based vocal entrainment by using the Student’s T-Test (α=
0.05) to examine whether the value is bigger in cases when a
spouse was rated with high positive emotions compared with
high negative emotions. The distribution of the PCA-based en-
trainment measures was approximately normal. Table 1 shows
the results of the hypothesis testing.
Table 1: Entrainment Levels: Higher in Positive Emotion vs
Negative Emotion.
Entrainment Type High Positive High Negative p-value
Toward (veT O) 0.8276 0.8198 0.0103
From (veF R ) 0.8307 0.8256 0.0699
Table 1 shows that when a spouse was rated with positive
affective state, the associated PCA-based entrainment measures
are higher for both of the direction (toward and from) though
only the direction of toward passed the (α= 0.05) significance
level. This result provides an evidence that the PCA-based en-
trainment measure describes the entrainment phenomenon that
is generally understood in marital communication.
Another hypothesis testing was conducted using the Stu-
dent’s T-Test (α= 0.05) to examine whether this PCA-based
entrainment computed in sequence of turn takings for actual in-
teracting couples has a larger value than when computed for
any random pair of speaking turns. The intuition is that if this
method captures the notion of coherence in dialogs, this mea-
sure should have a higher value compared to when computing
two turns that are randomly selected (between two people that
were not engaged in direct interaction). Instead of examining
both directions separately, average of the values were computed
across all 372 sessions (not restricting to only positive vs. nega-
tive sessions). Random entrainment values were computed with
10,000 random draws with replacement of a pair of turns from
non-interacting couples. Table 2 shows the statistical testing
result.
Table 2: Entrainment Levels: Higher in Pairs of Sequence in
Turn-Taking vs. Random Pair of Turns.
Pairs of Turns Random Pairs p-value
Avg. of Entrainment 0.8266 0.8231 0.018
Table 2 provides additional corroborating statistical evi-
dence that indeed this PCA-based method of computing en-
trainment captures a notion of vocal synchronization because
the value is greater overall when we compute it across turn-
sequences of interacting couples compared with turn-pairs of
non-interacting “couples”. These two hypothesis testing experi-
ments provide some grounding evidence that the signal-derived
PCA-based vocal entrainment measure is a viable method to
quantify interpersonal synchronization.
3.2. Experiment II
In Experiment II, we extend our statistical analysis to analyze
the strength of vocal entrainment in each interaction direction
(toward and from) given different conditions, termed here in-
teraction atmosphere. Here, for our problem context, we de-
fine interaction atmosphere as three types: 1, both spouses were
rated as having high positive emotion, 2, only one spouse was
rated with high positive emotion or with high negative emotion,
and 3, both spouses were rated as having high negative emo-
tions. The following is the list of the statistical testings with the
Student’s T-Test (α= 0.05).
•Test 1: Comparison of entrainment measure for type 1
vs. type 3 interactions: alternative hypothesis states that
the entrainment values are higher in type 1.
•Test 2: Comparison of entrainment measure for type 2
interactions: alternative hypothesis states that entrain-
ment values are higher when one spouse was rated as
high positive vs. one spouse was rated as high negative.
•Test 3: Comparison of entrainment measures for type
1 vs. type 2 interactions: alternative hypothesis states
that when both spouses were rated as high positive, en-
trainment values are higher compared to when only one
spouse was rated as high positive.
•Test 4: Comparison of entrainment measure for type 1
vs. type 2 interactions: alternative hypothesis stating that
when both spouses were rate high negative, entrainment
values are lower compared with only one spouse was
rated as high negative.
The summary of the statistical testing results of Experiment
II is in Table 3. Several notable points can be made with the re-
sult in Table 3. First, the vocal entrainment measures are higher
(in both directions) where both spouses were rated as having
high positive emotions (Test 1), which is expected as suggested
by various psychology literatures. Second, with this quantifica-
tion of vocal entrainment, results suggest that when one spouse
was rated with high positive emotion, he/she shows higher val-
ues of entrainment toward his/her interacting partner compared
to when he/she was rated as high negative (Test 2). This im-
plies that when a person is in a more positive emotion, his/her
3103
Table 3: Hypothesis Testing Summary (Various Interaction Atmosphere Types.
Test # (Entrainment Type) Mean of HoMean of Hap-value Test # (Entrainment Type) Mean of HoMean of Hap-value
Test 1 (toward) 0.8196 0.8289 0.0314 Test 1 (from) 0.8196 0.8289 0.0314
Test 2 (toward) 0.8189 0.8265 0.050 Test 2 (from) 0.8311 0.8321 0.3831
Test 3 (toward) 0.8265 0.8289 0.3126 Test 3 (from) 0.8289 0.8321 0.7741
Test 4 (toward) 0.8189 0.8196 0.5635 Test 4 (from) 0.8311 0.8196 0.009
vocal characteristics are becoming similar toward his/her inter-
acting partner to possibly ease the tension of the interaction or
provide support. However, the results indicate that his/her inter-
acting partners may not have displayed such entrainment toward
him/her. Test 3 results suggest that there is no difference in the
level of vocal entrainment when both spouses were rated with
positive emotion compared to when only one spouse was rated
with positive emotion. Lastly, this results suggest that when
both spouses were rated as high negative, they receive less vo-
cal entrainment from their interacting partner compared to when
only one spouse was rated as high negative (Test 4). This out-
come is also intuitive because when couples are both negative,
they would be less willing to entrain toward one another (less
likely to provide emotional support to each other). Through this
series of statistical testings, it is encouraging to observe that
this method can be a viable approach to perform detailed anal-
ysis of entrainment in relation to psychologist’s affective rating
of these distressed couples with a potential of performing many
more testings on entrainment for various interaction conditions.
3.3. Experiment III
The goal of this experiment is to study the predictive ability of
the vocal entrainment measure in recognizing spouse’s session-
level affective codes. We performed a baseline binary classifica-
tion using Support Vector Machine (with radial basis functions)
to differentiate high positive vs. high negative affective states
using this vocal entrainment measure. We focus on only one di-
rection of the entrainment (toward) for each spouse, since in Ta-
ble 3, it exhibits statistical significance difference between high
positive and high negative affective states. Nine different sta-
tistical functionals were computed per session (mean, variance,
range, maximum, minimum, 25% quantile, 75% quantile, in-
terquartile range, median). Evaluation was done using an leave-
one-couple-out cross validation, and we obtained an recognition
rate of 51.79%. A more detailed classification setup of recog-
nizing affective state using a multiple instance learning frame-
work further improves recognition rate to 53.93% with salient
vocal entrainment measures [12].
4. Conclusions & Future Works
The entrainment phenomenon is an integral aspect when ana-
lyzing couples interactions. Computational measures of vocal
entrainment can provide a quantitative characterization accom-
panying qualitative descriptions of this natural human commu-
nication phenomenon. In this work, we propose a PCA-based
vocal entrainment measure. It relies on the idea that to effec-
tively capture this subtle similarity between an interacting dyad,
we first construct a space (PCA) representing speaking charac-
teristics of each interlocutor with a set of common acoustic fea-
tures; then, the entrainment level is computed as the preserved
variability of another speaker represented in the transformed-
feature space of the original speaker. Analysis presented in Sec-
tion 3 shows that this is indeed a viable approach to quantify vo-
cal entrainment, and various statistical analyses using real cou-
ple interaction data have shown the differences in the strength
of directionality of vocal entrainment when each spouse is rated
with high positive compared to high negative affect.
Future works includes investigation of better representation
of speaking characteristics using various acoustic cues since a
suitable representation of the speaking style is a crucial aspect
while utilizing this method of PCA-based entrainment. An-
other research direction involves utilizing a more sophisticated
subspace construction method to overcome inherent problems
of PCA, such as its sensitivity to outliers A further direction
is to construct different representations that effectively capture
nonverbal behaviors. Since entrainment can provide insights
into conducting research on human-human communication, we
would like to extend this quantification scheme in hope to offer
psychology experts another choice of useful objective tools for
analysis of married couples communication.
5. Acknowledgments
This research was supported in part by funds from the National
Science Foundation, the Department of Defense, and the Na-
tional Institute of Health.
6. References
[1] A. R. McGarva and R. M. Warner, “Attraction and social coordi-
nation: Mutual entrainment of vocal activity rhymes,” Journal of
Psycholinguistic Research, vol. 32, no. 3, pp. 335–354, 2003.
[2] M. J. Richardson, K. L. Marsh, and R. Schmit, “Effects of vi-
sual and verbal interaction on unintentional interpersonal coordi-
nation,” Journal of Experimental Psychology: Human Perception
and Performance, vol. 31, no. 1, pp. 62–79, 2005.
[3] K. Eldridge and B. Baucom, Positive pathways for couples and
families: Meeting the challenges of relationships. WileyBlack-
well, ch. (in press) Couples and consequences of the demand-
withdraw interaction pattern.
[4] J. Dauwels, F. Vialatte, and A. Cichocki, “Diagnosis of alzheimers
disease from eeg signals: Where are we standing?” Current
Alzheimer’s Research (Invited Paper), 2011.
[5] C.-C. Lee, M. P. Black, A. Katsamanis, A. C. Lammert, B. R.
Baucom, A. Christensen, P. G. Georgiou, and S. S. Narayanan,
“Quantification of prosodic entrainment in affective spontaneous
spoken interactions of married couples,” in Proceedings of Inter-
speech, 2010.
[6] M. Kimura and I. Daibo, “Interactional synchrony in conversa-
tions about emotional episodes: A measurement by ’the between-
participants pseudosynchrony experimental paradigm’,” Journal
of Nonverbal Behavior, vol. 30, pp. 115–126, 2006.
[7] L. L. Verhofstadt, A. Buysse, W. Ickes, M. Davis, and I. Devoldre,
“Support provision in marriage: The role of emotional similar-
ity and empathic accuracy,” Emotion, vol. 8, no. 6, pp. 792–802,
2008.
[8] J. M. Gottman, “The roles of conflict engagement, escalation,
and avoidance in marital interaction: A longitudinal view of five
types of couples,” Journal of Consulting and Clinical Psychology,
vol. 61, no. 1, pp. 6–15, 1993.
[9] A. Christensen, D. Atkins, S. Berns, J. Wheeler, D. H. Baucom,
and L. Simpson, “Traditional versus integrative behavioral cou-
ple therapy for significantly and chronically distressed married
couples,” J. of Consulting and Clinical Psychology, vol. 72, pp.
176–191, 2004.
[10] A. Katsamanis, M. P. Black, P. G. Georgiou, L. Goldstein, and
S. S. Narayanan, “SailAlign: Robust long speech-text alignment,”
in Very-Large-Scale Phonetics Workshop, Jan. 2011.
[11] M. P. Black, A. Katsamanis, C.-C. Lee, A. C. Lammert, B. R.
Baucom, A. Christensen, P. G. Georgiou, and S. S. Narayanan,
“Automatic classification of married couples’ behavior using au-
dio features,” in Proceedings of Interspeech, 2010.
[12] C.-C. Lee, A. Katsamanis, M. P. Black, B. R. Baucom, P. G. Geor-
giou, and S. S. Narayanan, “Affective state recognition in married
couples’ interactions using pca-based vocal entrainment measures
with multiple instance learning,” in Submitted to ACII, 2011.
3104