Conference PaperPDF Available

Socially-Aware Virtual Agents: Automatically Assessing Dyadic Rapport from Temporal Patterns of Behavior



This work focuses on data-driven discovery of the temporally co-occurring and contingent behavioral patterns that signal high and low interper-sonal rapport. We mined a reciprocal peer tutoring corpus reliably annotated for nonverbals like eye gaze and smiles, conversational strategies like self-disclosure and social norm violation, and for rapport (in 30 second thin slices). We then performed a fine-grained investigation of how the temporal profiles of sequences of interlocutor behaviors predict increases and decreases of rapport, and how this rapport management manifests differently in friends and strangers. We validated the discovered behavioral patterns by predicting rapport against our ground truth via a forecasting model involving two-step fusion of learned temporal associated rules. Our framework performs significantly better than a baseline linear regression method that does not encode temporal information among behavioral features. Implications for the understanding of human behavior and social agent design are discussed.
Socially-Aware Virtual Agents: Automatically Assessing
Dyadic Rapport from Temporal Patterns of Behavior
Ran Zhao, Tanmay Sinha, Alan W Black, Justine Cassell
Language Technologies Institute, School of Computer Science
Carnegie Mellon University, Pittsburgh, PA 15213 USA
Abstract. This work focuses on data-driven discovery of the temporally co-
occurring and contingent behavioral patterns that signal high and low interper-
sonal rapport. We mined a reciprocal peer tutoring corpus reliably annotated for
nonverbals like eye gaze and smiles, conversational strategies like self-disclosure
and social norm violation, and for rapport (in 30 second thin slices). We then
performed a fine-grained investigation of how the temporal profiles of sequences
of interlocutor behaviors predict increases and decreases of rapport, and how this
rapport management manifests differently in friends and strangers. We validated
the discovered behavioral patterns by predicting rapport against our ground truth
via a forecasting model involving two-step fusion of learned temporal associ-
ated rules. Our framework performs significantly better than a baseline linear
regression method that does not encode temporal information among behavioral
features. Implications for the understanding of human behavior and social agent
design are discussed.
1 Introduction and Motivation
The year is 2025. Zack comes into math class with his personalized virtual peer agent
Zoe projected on his glasses. Zoe smiles as she says to Zack, “You look tired today.
I told you it was a bad idea to play “AR Starcraft” that late on weeknights!”. Zack
grimaces “OK, so I’m tired. But it was awesome! The whole math class was getting to
know one another - that’s work, right?” to which Zoe nods and responds by indexing
their shared experience - “Perhaps, but last time you did this, I was too exhausted the
next day to help you.
Zack and Zoe then work on the math task they are supposed to complete. Zoe starts
off - “We need to solve this set of linear equations 5x*(3x-18)=10 first”. Zack seems
a bit confused “Well, I’m familiar with fractions, but I suck at linear equations.” Zoe
gazes at the work sheet, then back at Zack and finally provides motivational scaffolding
in the form of negative self disclosure followed by praise, in order to boost their inter-
personal bond, and Zack’s confidence “Don’t worry, I used to suck at linear equations
too, but you’re a rockstar at this stuff. You’ll be fine. Besides which, we’ll go through it
together,” following with a smile.
This vision illustrates several factors related to the important role that the relation-
ship between learners, or learners and their tutors, can play in improving learning gains.
While this phenomenon is described in the educational literature [20, 23], there has ex-
isted no rigorous models of the mechanism underlying the relationship between social
and cognitive functioning in tasks such as these [24], nor do there exist computational
Zhao, R., Sinha, T., Black, A., Cassell, J. (2016) “Socially-Aware Virtual Agents: Automatically Assessing Dyadic Rapport from Temporal Patterns of Behavior.” Proceedings of Intelligent Virtual
Agents (IVA 2016).
models of interpersonal closeness that can drive the functioning of an intelligent tutor.
There is therefore great opportunity to expand on the social capabilities of current ed-
ucational technologies in order to create long-term interpersonal connectedness in the
service of increased adaptivity in learning [30], and thereby increased learning gains.
In this vein, here we investigate the dynamics of social interaction in longitudinal peer
tutoring, as manifested in manifested in verbal and nonverbal behaviors. The aspect of
social interaction that we focus on is rapport management, as rapport is argued to be
one of the central constructs necessary to understanding successful helping relation-
ships [4]), and rapport management is abundantly present in peer tutoring [32]
Let us at this point step back to describe what we mean by rapport. Rapport is often
defined as ”a close and harmonious relationship in which the people concerned appear
to understand each other’s feelings or ideas and communicate well,” however we feel it
is best described by examples and so, below, are two examples from our corpus, of high
and low rapport, respectively.
High Rapport: Low Rapport:
P1:I suck at negative numbers; P2: [silent][long pause]
P2: it’s okay so do I; P1: shh;[long pause]
P1:{smile}P2: alright;
P2:uh actually no I don’t, negative numbers are easy P1: let me do my work;
In our own prior work, we proposed a computational model of long-term interper-
sonal rapport to explain how humans in dyadic interactions build, maintain and destroy
rapport through the use of specific conversational strategies [38]. Because these strate-
gies function to fulfill specific social goals and are instantiated in particular verbal and
nonverbal behaviors, studying the synergistic interaction of conversational strategies
and nonverbal behaviors on rapport management is important. To do so, not only a
qualitative examination of certain dyadic behavior patterns that benefit or hurt interper-
sonal rapport is essential, but it is also desirable to build automated frameworks to learn
fine-grained behavioral interaction patterns that index such social phenomena. The lat-
ter has received less attention, in part due to the time-intensive nature of collecting and
annotating behavioral data for different aspects of interpersonal connectedness, and the
difficulty of developing and using machine learning algorithms that can take the time
course of interaction among different modalities and between interlocutors into account.
Learning fine-grained behavioral interaction patterns that index rapport is the focus of
the current work. There are three key issues that we believe should be taken into con-
sideration when performing such assessment.
(1) When the foundational work by [35] described the nature of rapport, three in-
terrelating components were posited: positivity, mutual attentiveness and coordination.
Their work demonstrated, that over the course of a relationship, positivity decreases
and coordination increases. Factors such as these, then, depend on the stage of relation-
ship between interlocutors [38], and therefore it is necessary to take into account the
relationship status of a dyad when extracting dyadic patterns of rapport. (2) while our
previous work [28] discovered some of the common behaviors exhibited by dyads in
peer tutoring to build or maintain rapport; playful teasing, face-threatening comments,
attention-getting, etc., tutors and tutees were looked at separately, and each of these
behaviors was examined in isolation from one another. In the current work, our inter-
est is in moving beyond individual behaviors to focus on temporal sequences of such
behaviors in the dyadic context. Likewise, our prior work did not distinguish between
rapport management during task (tutoring) vs social activities. We believe that the in-
teractions between verbal and nonverbal behaviors may manifest differently in social
and tutoring periods, since the roles of a tutor and tutee are more evident in the tutoring
compared to the social periods. (3) Most prior computational work examining rapport,
such as [12, 13, 18], has used post-session questionnaires to asses rapport. However, to
measure the effect of multimodal behavioral patterns on rapport and better reason about
the dynamics of social interaction, a finer-grained ground truth for rapport is needed.
In this paper, then, we take a step towards addressing the above limitations. To create
a longitudinal peer tutoring corpus, we compared friend to stranger dyads, bringing each
dyad back for five face-to-face sessions over five weeks. In each session, two tutoring
periods were interspersed with three social periods. The students switched roles so that
each both tutored and was tutored. We employed thin-slice coding [2] to elicit ground
truth for rapport, by asking naive raters to judge rapport for every 30 second slice of
the hour long peer tutoring session, presented to raters in a randomized order. This, in
turn allowed us to analyze fine-grained sequences of verbal and nonverbal behaviors
that were associated with high or low rapport between the tutor and tutee.
As a side note, while the current paper addresses these phenomena in the context
of peer tutors and intelligent tutoring agents, this work is part of a larger research pro-
gram that targets more general models of how to predict rapport between interlocutors
in real time, using as input the interaction among linguistic (verbal) and nonverbal (vi-
sual) behaviors. This basic science serves as input in some of our work into embodied
conversational agents that can use the dyad’s current rapport as part of a decision about
what to say next to manage rapport with the user as, in turn, input into a decision about
how best to help the user achieve his/her goals, goals that include, in some of our agents,
peer tutoring.
2 Related Work
2.1 Individual-focused Temporal Relations
The study of temporal relationships between verbal and nonverbal behaviors has been
of prime importance in understanding various social and cognitive phenomena. A lot of
this work has focused on the observable phenomena of interaction (low level linguis-
tic, prosodic or acoustic behaviors that can be automatically extracted) or has leveraged
computational advances to extract head nods, gaze, facial action units, etc., as a step
towards modeling co-occurring and contingent patterns inherent in an individual per-
son’s behavior. Since feature extraction approaches that aggregate information across
time are not able to explicitly model temporal co-occurrence patterns, two popular tech-
nical approaches to investigate temporal patterns of verbal and nonverbal behaviors are
histogram of co-occurrences [29] and motif discovery methods [27].
For instance, [21] presented a study of co-occurrence patterns of human nonverbal
behaviors during intimate self-disclosure. However, contingent relations between dif-
ferent nonverbal behaviors was not considered, which could extensively contribute to
the design of a social agent that interacts with a human over time. [36] learned behav-
ioral indicators that were correlated to expert judgess opinions of each key performance
aspect of public speaking. They fused the modalities by utilizing a least squared boosted
regression ensemble tree and predicted speaker performance. However, this work also
did not consider the effect of interactions among different modalities and their tem-
poral relations. In similar vein, [6] introduced deep conditional neural fields to model
the generation of gestures by integrating verbal and acoustic modalities, while using an
undirected second-order linear chain to preserve temporal relations between gestures
as well. However, this approach only modeled individual co-verbal gestures, without
considering interaction between the speaker and the interlocutor.
In [17] temporal combinations of individual facial signals (such as nod, smiles etc.)
were used to infer positive (agree, accept etc.) and negative (dislike, disbelief etc.)
meanings via ratings by humans. An interesting take-away from this work was that
a combination of signals could significantly alter the perceived meaning. For instance,
facial tension alone and frown alone did not mean “dislike, but the combination frown
and tension did. Tilt alone and gaze right down alone did not mean “not interested as
significantly as the combination tilt and gaze. However, while a combination of these
nonverbals signaled higher level constructs (that were in turn associated with some
pragmatic meaning), the authors were more interested in how these combinations were
perceived by humans, rather than necessarily in a predictive task or testing these com-
binations in a human-agent dialog.
2.2 Dyadic Temporal Relations
In a conversation, attending to the contribution of both interactants adds greater com-
plexity in reasoning about the social aspects of the interaction.Listeners show their in-
terest, attention and understanding in many ways during the speakers utterances. Such
“listener responses” [10], which may be manifested through gaze direction and eye
contact, facial expressions, use of short utterances like “yeah”, “okay”, and “hm-m”
etc or even intonation, voice quality and content of the words, are carriers of subtle
information. These cues may convey information regarding understanding (whether the
listeners understand the utterance of the speaker), attentiveness (whether the listeners
are attentive to the speech of the speaker), coordination, and so forth.
For instance, [14] looked at observable lexical, acoustic and prosodic cues produced
by the speaker followed by back channeling from the listener. The authors found that
the likelihood of occurrence of a backchannel from the interlocutor appeared to in-
crease with simultaneous occurrence of one or more cues by the speaker, such as final
rising intonation, higher intensity and pitch levels, longer inter-pausal units (maximal
sequence of words surrounded by silence longer than 50 ms) etc. However, in this work,
no attempt was made to use the temporal sequence or co-occurrence of observables pre-
ceding a backchannel to predict higher level social constructs such as positivity, coor-
dination, attentiveness, or underlying psychological states such as rapport or trust.
[1] explored the interplay between head movements, facial movements like smile
and eye brow raising, and verbal feedback in a range of conversational situations, in-
cluding continued attentiveness, understanding, agreement, surprise, disappointment,
acknowledgment and refusing information. As the situations became more negative
(disappointment, refusing information), the accompanying nonverbals became more ex-
tensive in time - no longer just a head nod, but a series of movements. The authors claim
that this series of movements functioned to add some extra information or to emphasize
or contradict what had been said, but ground truth was not provided for these claims.
Finally in [7], the authors used sequence mining methods to automatically extract
nonverbal behavior sequences of the recruiters that were representative of interpersonal
attitudes. Then, Bayesian networks were deployed to build a generation model for com-
puting a set of nonverbal sequence candidates, which were further ranked based on the
previously extracted frequent sequences. Even though this work considered the effect of
sequencing of nonverbal signals, their model could be improved by the addition of tem-
poral information inside these sequences, the addition of verbal signals and modeling
of listeners’ behaviors as well.
3 Study Context
3.1 Data
Reciprocal peer tutoring data was collected from 12 American English-speaking dyads
(6 dyads were friends and 6 strangers; 6 were boys and 6 girls), with a mean age of
13 years old ranging from 12 to 15, who interacted for 5 hourly sessions over as many
weeks (a total of 60 sessions, and 5400 minutes of data), tutoring one another on proce-
dural and conceptual aspects of linear equations [37]. All interactions were videotaped
from three camera views (a frontal view of each participant and a side view of the two
participants). Speech was recorded by lapel microphones in separate audio channels.
Each session began with a period of getting to know one another, after which the first
tutoring period started, followed by another small social interlude, a second tutoring pe-
riod with role reversal between the tutor and tutee, and then the final social time. Prior
work demonstrates that peer tutoring is an effective paradigm that results in student
learning [31], making this an effective context to study dyadic interaction with a con-
crete task outcome. Our student-student data demonstrates that a tremendous amount of
rapport-building takes place during the task of reciprocal tutoring [33]. In their recent
review of the research on design spaces for computer supported reciprocal tutoring,
[8] emphasize reciprocal tutoring to be a natural extension of one-on-one tutoring in
today’s networked world.
3.2 Annotations
We assessed rapport-building via thin slice annotation [2], or rapidly made judgments
of interpersonal connectedness in the dyad, based on brief exposure to their verbal and
nonverbal behavior. Naive raters were provided with a simple definition of rapport and
three raters annotated every 30 second video segment of the peer tutoring sessions for
rapport using a 7 point likert scale. Weighted majority rule was deployed to mitigate
bias from the ratings of different annotators, account for label over-use and under-use
and pick a single rapport rating for each 30 second video segment. The segments were
presented to the annotators in random order so as to ensure that raters were not actually
annotating the delta of rapport over the course of the session. Prior work has shown
that such reliably annotated measures of interpersonal rapport are causally linked to
behavioral convergence of low-level linguistic features (such as speech rate etc,) of
the dyad [32, 33] and that greater likelihood of being in high rapport in the next 30
sec segment (improvement in rapport dynamics over the course of the interaction) is
positively predictive of the dyad’s problem-solving performance.
In addition, we also annotated the entire corpus for conversational strategies such
as self-disclosure (Krippendorf’s = 0.753), reference to shared experience (= 0.798),
praise (=1), social norm violation (= 0.753) and backchannel (= 0.72) in the first
pass, and reciprocity in these strategies (using a time window of roughly 1 minute)
in the second pass (= 0.77). [34] has investigated the phenomenon of congruence or
interpersonal synchrony in usage of such conversational strategies, in absolute num-
ber as well as the pattern of timings, and found positive relationships with rapport and
problem-solving performance. In other work, we have also shown that these conversa-
tional strategies can be reliably detected from observable indicators of verbal, visual
and acoustic cues an accuracy of over 80% and kappa ranging from 60-80% [39]. Fi-
nally, our temporal association rule framework comprised of nonverbal behaviors like
eye gaze (Krippendorf’s = 0.893) and smiles (= 0.746), which we have found to
significantly co-occur with conversational strategies [39].
4 Method
The technical framework we employ in this work is essentially an approach for pattern
recognition in multivariate symbolic time sequences, called the Temporal Interval Tree
Association Rule Learning (Titarl) algorithm [15]. Since it is practically infeasible to
predict exactly when certain behavioral events happen, it is suitable to use probabilistic
approaches that can extract patterns with some degree of uncertainty in the temporal re-
lation among different events. Temporal association rules, where each rule is composed
of certain behavioral pre-conditions (input events) and behavioral post-conditions (out-
put events), are one such powerful approach. In our case, input events are conversational
strategies and nonverbal behaviors such as violation social norms, smile etc. The out-
put event is the absolute value of thin-slice rapport. Because interpersonal rapport is a
social construct that is defined at the dyadic level, the applied framework helps reveal
interleaved behavioral patterns from both interlocutors. An example of a simple generic
temporal rule is given below. It illustrates the rule’s flexibility by succinctly describing
not only the temporal inaccuracy of determining the temporal location of output event,
but also its probability of being fired.
”If event A happens at time t, there is 50% chance of event B happening between time
t+3 to t+5”.
Intuitively, the Titarl algorithm is used to extract large number of temporal association
rules (r) that predict future occurrences of specific events of interest. The dataset com-
prises both multivariate symbolic time sequences Ei=1...n and multivariate scalar time
series Si=1...m, where Ei={ti
j2R}is the set of times that event eihappens and Siis
an injective mapping from every time point to a scalar value. Before the learning pro-
cess, a parameter wor the window size is specified, which allows us at each time point
tto compute the probability for the target event to exist in the time interval [t, t +w].
The four main steps in the Titarl algorithm [15] are: (i) exhaustive creation of simple
unit rules that are above the threshold value of confidence or support, (ii) addition of
more input channels in order to maximize information gain, (iii) production of more
temporally precise rules by decreasing the standard deviation of the rule’s probability
distribution, (iv) refinement of the condition and conclusion of the rules by application
of Gaussian filter on temporal distribution. Confidence, support and precision of the rule
are three characteristics to validate its interest and generalizability. For a simple unit rule
! e2(confidence: x%, support:y%), confidence refers to the probability of
a prediction of the rule to be true, support refers to the percentage of events explained
by the rule and precision is an estimation of the temporal accuracy of the predictions.
supportr={#e2|r is active}
standard deviationr
5 Experimental Results
We first separated out friend and stranger dyads to learn rules from their behaviors
separately. We also tagged the data as occurring during a social or tutoring period, and as
being generated by a tutor or a tutee. We then randomly divided the friend and stranger
groups into a training set (4 dyads) and test set (2 dyads). In the first experiment, we
extracted a potentially large number of temporal association rules affiliated with each
individual rapport state (from 1 to 7). In this experiment, for each event, we looked back
60 seconds to find behavioral patterns associated with it. A representative example is
shown in figure 1, and descriptions of some of the rules in the test set whose confidence
are above 50% and for whom the number of cases the rule applies to are more than
20 times are described below, divided into friends (F) and strangers (S) and into high
rapport (H), defined as thin-slice rapport states 5, 6, and 7 and low rapport (L), defined
as states 1, 2, and 3.
5.1 Behavioral Rules for Friends
There are 14,458 total rules for friends with confidence higher than 50%, 14,345 of
which apply to friends in high rapport states. Overall, engaging in reference to shared
experience, smiling while violating a social norm and overlapping speech are associated
with high rapport. Examples are:
FH 1 One of the student smiles while the other violates a social norm (Social period)
FH 2 One of the students refers to shared experience (Social period)
FH 3 One student smiles and violates a social norm, and the second smiles and gazes at
the partner within the next minute (Social period)
FH 4 The two conversational partners overlap speech while one is smiling, following
which the second starts smiling within the next minute (Social period)
FH 5 The tutee reciprocates a social norm violation while overlapping speech with the
tutor, following which the tutor smiles and violates a social norm (Task period)
[shown in Figure 1]
In contrast to the high number of rules with confidence higher than 50% for friends
in high rapport, there are only 113 rules that satisfy these criteria for friends in low
rapport. Some examples are:
FL 1 The tutor finishes violating a social norm while gazing at the tutee’s work sheet,
and within the next minute the tutee follows up with a social norm violation, but
gazing at his/her own work sheet (Task period)
FL 2 The tutor reciprocates a social norm violation without a smile and neither the tutee
nor the tutor gaze at one another. Meanwhile, the tutee begins violating another
social norm within the next minute (Task period)
FL 3 The tutor backchannels while gazing at his/her own work sheet and does not smile.
Moreover, the tutor also overlaps with the tutee in the next minute (Task period)
5.2 Behavioral Rules for Strangers
There are 761 total rules for strangers, of which 130 are rules that apply to strangers in
high rapport. In general, smiling and overlapping speech while using particular conver-
sational strategies are associated with high rapport. Some examples are:
SH 1 One of the interlocutors smiles while the other gazes at him/her and begins self-
disclosing, and they overlap speech within the next minute (Social period)
SH 2 One of the interlocutors smiles and backchannels in the next minute (Social period)
SH 3 The interlocutors’ speech overlaps and the tutee smiles within the next minute (Task
631 rules, then, explain strangers in low rapport. Interestingly, rules that explain low
rapport among strangers most often come from task periods. In general, overlapping
speech after a social norm violation leads to low rapport in strangers. Some examples
SL 1 The tutor smiles and gazes at the worksheet of the tutee while the tutee does not
smile (Task period)
SL 2 The tutor violates social norms while being gazed at by the tutee, and their speech
overlaps within the next minute (Task period)
SL 3 The tutor smiles and the tutee violates a social norm within the next 30 seconds,
before their speech overlaps within the next 30 seconds (Task period) [shown in
Figure 2]
6 Validation and Discussion
In order to demonstrate that the extracted temporal association rules can be reliably
used for forecasting of interpersonal human behavior, we first applied machine learn-
ing to perform an empirical validation, which we describe in the next subsection. The
motivation behind constructing this forecasting model was to prove the automatically
learned temporal association rules are good indicators of the dyadic rapport state. In
the subsequent subsections of the discussion, we will discuss implications of our work
for the understanding of human behavior and the design of “socially-skilled” agents,
linking prior strands of research.
6.1 Estimation of Interpersonal Rapport
In addition to its applicability to sparse data, one of the prime benefits of the tempo-
ral association rule framework to predict a high-level construct such as rapport lies in
its flexibility in modeling presence/absence of human behaviors and also the inherent
uncertainty of such behaviors, via a probability distribution representation in time. In
summary, the estimation of rapport comprises two steps: in the first step, the intuition is
Fig. 1: Friends in high rapport - The tutee reciprocates a social norm violation while over-
lapping speech with the tutor, following which the tutor smiles while the tutee violates a
social norm.
An example from the corpus is shown below:
Tutor: Sweeney you can’t do that, that’s the whole point{smile};[Violation of Social Norm]
Tutee: I hate you.I’ll probably never never do that; [Reciprocate Social Norm Violation]
Tutor: Sweeney that’s why I’m tutoring you{smile};
Tutee: You’re so oh my gosh{smile}.We never did that ever; [Violation of Social Norm]
Tutor: {smile}What’d you say?
Tutee: Said to skip it{smile};
Tutor: I can just teach you how to do it;
Fig. 2: Strangers in low rapport - The tutor smiles and the tutee violates a social norm within
the next 30 seconds, before their speech overlaps within the next 30 seconds.
An example from the corpus is shown below:
Tutee: divide oh this is so hard let me guess;eleven;
Tutor: you know;
Tutee: six;
Tutor: next problem is is exactly the samesmile, over eleven equals, eleven x over eleven;
Tutee: I don’t need your help; [Violation of Social Norm]
Tutor: {Overlap}That is seriously like exactly the same.
to learn the weighted contribution (vote) of each temporal association rule in predicting
the presence/absence of a certain rapport state (via seven random-forest classifiers); in
the second step, the intuition is to learn the weight of each binary classifier for each
rapport state, to predict the absolute continuous value of rapport (via linear regression).
For clarity, we will use the following three mathematical subscripts to represent dif-
ferent types of index. i: index of output events, k: index of time-stamps,j: index of
temporal association rules.
Each individual rapport state is treated as a discrete output event ei, where i=
1,2,3,4,5,6,7. We learn the set of temporal association rules Ri={ri
j}for each
output event ei. In the first step, a matrix Miis constructed with |Ti|rows and 1+|Ri|
columns, where Ti={ti
k2R}denotes the set of time-stamps at which at least one of
the rules in set Riis activated. Mi(k, j)2[0,1] denotes confidence of the rule ri
jat the
particular time point ti
k. The extra column represents the indicator function of rapport
Relationship Status t-test value Mean value (Mean Square Error) Effect Size
Friends t(1,14)=-6.41*** Titarl=1.257, Linear Regression=2.120 -0.42
Strangers t(1,14)=-8.78*** Titarl=0.837, Linear Regression =1.653 -0.62
Table 1: Statistical analysis comparing mean square regression of Titarl-based regression and a
simple linear regression, for all possible combination of training and test sets in the corpus. Effect
size assessed via Cohen’s d. Significance: ***:p<0.001, **:p<0.01, *:p<0.05
state: Mi(k, |Ri|+ 1) = {1,if ti
k2Ei; 0 otherwise}. Seven random-forest classifiers
(fi(t)and t2Ti)) are then trained on each corresponding matrix Miusing the last
column (binary) as the output label and all other columns as input features [16]. In the
second step, another matrix Gwith |T|rows and 1+|C|columns is formalized, where
|C|is the number of random-forest classifiers, G(k, i)=fi(tk)and T={tk|tk2
Ti,i =1...7}. The last column is the absolute number of rapport state gathered by
ground truth. This matrix is used to train a linear regression model.
For our corpus, as part of the Titarl-based regression approach, we first extracted
the top 6000 rules for friend dyads and 6000 rules for stranger dyads from the training
dataset, with the following parameter settings: minimum support: 5%, minimum con-
fidence: 5%, maximum umber of conditions: 5, minimum use: 10. Second, we fused
those rules based on algorithm discussed above and applied them on test set, perform-
ing a 10-fold cross validation. In order to test the robustness of the results, we repeated
the experiment for all possible random combinations of training (4 dyads) and test (2
dyads) sets for friends and strangers, and performed a correlated samples t-test to test
whether our approach results in lower mean squared error compared to a simple linear
regression model that treats each of the verbal and nonverbal modalities as independent
features to predict the absolute value of rapport. Evaluation for performance metrics in
this basic linear regression approach was done using the supplied test set of randomly
chosen 2 dyads for each experimental run. In addition, we also calculated effect size
via Cohen’sd d(2t/pdf ), where tis the value from the t-test and df refers to the de-
grees of freedom. Results in Table 1 suggest that the Titarl-based regression method
has a significantly lower mean square error than the naive baseline linear regression
method.The high effect size in both strangers (d=-0.62) and friends (d=-0.42) further
prove the substantial improvement on accuracy of assessing rapport by Titarl-based re-
gression comparing to simple linear regression.
These results have been integrated into a real-time end-to-end socially aware dia-
log system (SARA),1described in [26]. SARA is capable of automatically detecting
conversational strategies based on verbal, nonverbal, and acoustic features in the user’s
input [39], relying on the conversational strategies detected in order to accurately esti-
mate rapport between the interlocutors, reasoning about what conversational strategy to
respond with as the next turn, and generating those appropriate responses in the service
of more effectively carrying out her task duties. To our knowledge, SARA is the first
socially-aware dialog system that relies on visual, verbal, and vocal cues to detect user
social and task intent, and generates behaviors in those same channels to achieve her
social and task goals.
6.2 Implications for Understanding Human Behavior
One of the important behavior patterns that plays out differently across friends and
strangers, and whose interactions can lead to either high or low rapport, is smiling
in combination with social norm violations and speech overlap. A violation of social
norms without a smile is always followed by low rapport. On the other hand, a social
norm violation accompanied by a smile is followed by high rapport when followed by
overlap and performed among friends. Meanwhile, violating social norms while smiling
leads to low rapport when followed by overlap if performed among strangers [See FH1,
FH3, FH5, FL1, FL2, SL3]. What we may be seeing here is what [11] described as em-
barrassment following violations of “ceremonial rules” (social norms or conventional
behavior), which is less often seen among family and friends than among strangers and
new acquaintances. Similarly, [22] emphasized that the smile is a kind of hedge, sig-
naling awareness of a social norm being violated and serving to provoke forgiveness
from the interlocutor. Overlap in this context may be an index of the high coordination
that characterizes conversation among friends whereby simultaneous speech indicates
comfort, or that same overlap may indicate the lack of coordination that characterizes
strangers who have not yet entrained to one another’s speech patterns [5]. Our findings
provide further empirical support for this body of prior work.
Another important contingent pattern of behaviors discussed here is the interaction
between smile and backchannels [See SH2, FL3]. In general a backchannel + smile
was indicative of high rapport, perhaps because the smile + backchannel indicated that
the listener was inviting a continuation of the speaker’s turn, but also indicating his/her
appreciation of the interlocutor’s speech [3].
We also discover the interaction between smile, the conversational strategy of self-
disclosure and overlaps [See SH1]. Smiles invite self-disclosure, after which an overlap
demonstrates responsiveness of the interlocutor. [25] have shown that partner respon-
siveness is a significant component of the intimacy process that benefits rapport. Finally
we described how the presence of overlaps with a nonverbal behavior or conversational
strategy often signals high rapport in friends but low rapport in strangers [See SH3, FL3,
SL2, SL3]. Prior work has found that friends are more likely to interrupt than strangers,
and the interruptions are less likely to be seen as disruptive or conflictual [5].
6.3 Implications for Social Agent Design
Rules such as those presented above can play a fundamental role in building socially-
aware agents that adapt to the rapport level felt by their users in ways that previous
work has not addressed. For example, [12] extracted a set of hand-crafted rules based
on social science literature to build a rapport agent. Such rules not only need expert
knowledge to craft, but may also be hard to scale up and to transfer to different domains.
In our current work, we alleviate this problem by automatically extracting behavioral
rules that signal high or low rapport, learning on verbal and nonverbal annotations of a
particular corpus, but employing only the annotations of conversational strategies that
did not concern the content domain of the corpus. This also represents an advance on
work by [19] that improved rapport through nonverbal and para-verbal channels, but
did not take linguistic information or temporal co-occurrence across modalities into ac-
count. We included linguistic information in our rules and In other work we have shown
that the linguisic information (conversational strategies) that formed an essential part of
the temporal rules presented here can be automatically recognized [39]. Similarly, [9]’s
gaze-reactive pedagogical agent diagnoses disengagement or boredom by the use of eye
trackers. However, only taking eye gaze into account forfeits the potential synergistic
effect of interaction across modalities.
As noted above, while our current work focused on developing an interpretable and
explanatory model of temporal behaviors to serve as a building block for our rapport-
aligned peer-tutoring system (RAPT), the framework can be applied for prediction of
other social phenomena of interest in virtual agent systems (such as trust and intimacy),
in domains as diverse as survey interviewing, sales, and health.
7 Conclusion
In this work, we utilized a temporal association rule framework for automatic discovery
of co-occurring and contingent behavior patterns that precede high and low interper-
sonal rapport in dyads of friends and strangers. Our work provides insights for bet-
ter understanding of dyadic multimodal behavior sequences and their relationship with
rapport which, in turn, moves us forward towards the implementation of socially-aware
agents of all kinds - including “socially-skilled” virtual peer tutors that can assess the
state of a relationship with a student, sigh in frustrated solidarity about a learning task
at hand, and know how to respond to maximize learning in the peer tutoring context.
Among the patterns our rules discovered were the interaction of smiles and backchan-
nels in signaling mutual attention and appreciation, and the pattern of self-disclosure,
followed or preceded by smiles and speech overlap, as an indicator of high rapport. We
found smiles to be one way in which interlocutors appear to mitigate the face-threat of
social norm violations such as insults. However, our experiments discovered that while
the presence of speech overlaps with smiles and social norm violations in friends signals
high rapport, the presence of speech overlaps with social norm violations in strangers
signals low rapport. In addition, for prediction of rapport, we observed the benefits (sig-
nificantly lower mean square prediction error) of constructing predictor variables that
work on fine-grained representation of social behaviors, explicitly model the temporal
relations among them and encode ordering as well as timing, over using simple ag-
gregated behavioral descriptors in a baseline linear regression model that are crudely
Limitations of the current work include our focus on rapport states; in future work
we will also want to find the temporal association rules that lead to a delta in rapport.
In addition, while the current work discovers those behaviors that directly precede a
rapport state, we have not yet verified that the link is causal. In service to that goal, our
current work has implemented the temporal association rules as a real-time module, and
has integrated them into a working virtual agent system. Our future work will use this
system to evaluate the causal nature of these rules, and their effect on human - virtual
agent interaction.
1. Jens Allwood and Loredana Cerrato. A study of gestural feedback expressions. In First
nordic symposium on multimodal communication, pages 7–22. Copenhagen, 2003.
2. Nalini Ambady and Robert Rosenthal. Thin slices of expressive behavior as predictors of
interpersonal consequences: A meta-analysis. Psychological bulletin, 111(2):256, 1992.
3. Elisabetta Bevacqua, Maurizio Mancini, and Catherine Pelachaud. A listening agent exhibit-
ing variable behaviour. pages 262–269, 2008.
4. Joseph N Cappella. On defining conversational coordination and rapport. Psychological
Inquiry, 1(4):303–305, 1990.
5. Justine Cassell, Alastair J Gill, and Paul A Tepper. Coordination in conversation and rapport.
In Proceedings of the workshop on Embodied Language Processing, pages 41–50. ACL,
6. Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. Predicting co-verbal ges-
tures: A deep and temporal modeling approach. pages 152–166, 2015.
7. Mathieu Chollet, Magalie Ochs, and Catherine Pelachaud. From non-verbal signals sequence
mining to bayesian networks for interpersonal attitudes expression. In International Confer-
ence on Intelligent Virtual Agents, pages 120–133. Springer, 2014.
8. Chih-Yueh Chou and Tak-Wai Chan. Reciprocal tutoring: Design with cognitive load shar-
ing. International Journal of Artificial Intelligence in Education, pages 1–24, 2015.
9. Sidney D’Mello, Andrew Olney, Claire Williams, and Patrick Hays. Gaze tutor: A gaze-
reactive intelligent tutoring system. International Journal of Human-Computer Studies,
70(5):377–398, 2012.
10. Donna T Fujimoto. Listener responses in interaction: A case for abandoning the term,
backchannel. 2009.
11. Erving Goffman. Interaction ritual: Essays in face to face behavior. AldineTransaction,
12. Jonathan Gratch, Anna Okhmatovskaia, Francois Lamothe, Stacy Marsella, Mathieu
Morales, Rick J van der Werf, and Louis-Philippe Morency. Virtual rapport. pages 14–27,
13. Jonathan Gratch, Ning Wang, Jillian Gerten, Edward Fast, and Robin Duffy. Creating rapport
with virtual agents. In Intelligent Virtual Agents, pages 125–138. Springer, 2007.
14. Agust´
ın Gravano and Julia Hirschberg. Backchannel-inviting cues in task-oriented dialogue.
In INTERSPEECH, pages 1019–1022, 2009.
15. Mathieu Guillame-Bert and James L. Crowley. Learning temporal association rules on sym-
bolic time sequences. pages 159–174, 2012.
16. Mathieu Guillame-Bert and Artur Dubrawski. Learning temporal rules to forecast events in
multivariate time sequences.
17. Dirk Heylen, Elisabetta Bevacqua, Marion Tellier, and Catherine Pelachaud. Searching for
prototypical facial feedback signals. In Intelligent Virtual Agents, pages 147–153. Springer,
18. Lixing Huang, Louis-Philippe Morency, and Jonathan Gratch. Virtual rapport 2.0. In Intel-
ligent Virtual Agents, pages 68–79. Springer, 2011.
19. Lixing Huang, Louis-Philippe Morency, and Jonathan Gratch. Virtual rapport 2.0. pages
68–79, 2011.
20. David W Johnson. Student-student interaction: The neglected variable in education. Educa-
tional researcher, 10(1):5–10, 1981.
21. Sin-Hwa Kang, Jonathan Gratch, Candy Sidner, Ron Artstein, Lixing Huang, and Louis-
Philippe Morency. Towards building a virtual counselor: modeling nonverbal behavior dur-
ing intimate self-disclosure. In Proceedings of the 11th ICAAMS-Volume 1, pages 63–70,
22. Dacher Keltner and Brenda N Buswell. Embarrassment: its distinct form and appeasement
functions. Psychological bulletin, 122(3):250, 1997.
23. Adena M Klem and James P Connell. Relationships matter: Linking teacher support to
student engagement and achievement. Journal of school health, 74(7):262–273, 2004.
24. Karel Kreijns, Paul A Kirschner, and Wim Jochems. Identifying the pitfalls for social inter-
action in computer-supported collaborative learning environments: a review of the research.
Computers in human behavior, 19(3):335–353, 2003.
25. Jean-Philippe Laurenceau, Lisa Feldman Barrett, and Paula R Pietromonaco. Intimacy as
an interpersonal process: the importance of self-disclosure, partner disclosure, and perceived
partner responsiveness in interpersonal exchanges. Journal of personality and social psy-
chology, 74(5):1238, 1998.
26. Yoichi Matsuyama, Arjun Bhardwaj, Ran Zhao, Oscar J. Romero, Sushma Akoju, and Justine
Cassell. Socially-aware animated intelligent personal assistant agent. In 17th Annual SIGdial
Meeting on Discourse and Dialogue, 2016.
27. Yukiko I Nakano, Sakiko Nihonyanagi, Yutaka Takase, Yuki Hayashi, and Shogo Okada.
Predicting participation styles using co-occurrence patterns of nonverbal behaviors in col-
laborative learning. pages 91–98, 2015.
28. Amy Ogan, Samantha Finkelstein, Erin Walker, Ryan Carlson, and Justine Cassell. Rudeness
and rapport: Insults and learning gains in peer tutoring. In Intelligent Tutoring Systems, pages
11–21. Springer, 2012.
29. Vikram Ramanarayanan, Chee Wee Leong, Lei Chen, Gary Feng, and David Suendermann-
Oeft. Evaluating speech, face, emotion and body movement time-series features for auto-
mated multimodal presentation scoring. pages 23–30, 2015.
30. Nikol Rummel, Erin Walker, and Vincent Aleven. Different futures of adaptive collaborative
learning support. International Journal of Artificial Intelligence in Education, 26(2):784–
795, 2016.
31. Anna M Sharpley, James W Irvine, and Christopher F Sharpley. An examination of the
effectiveness of a cross-age tutoring program in mathematics for elementary school children.
American Educational Research Journal, 20(1):103–111, 1983.
32. Tanmay Sinha and Justine Cassell. Fine-grained analyses of interpersonal processes and their
effect on learning. In Artificial Intelligence in Education, pages 781–785. Springer, 2015.
33. Tanmay Sinha and Justine Cassell. We click, we align, we learn: Impact of influence and
convergence processes on student learning and rapport building. In Proceedings of the 2015
Workshop on Modeling Interpersonal Synchrony, 17th ACM International Conference on
Multimodal Interaction. ACM, 2015.
34. Tanmay Sinha, Ran Zhao, and Justine Cassell. Exploring socio-cognitive effects of con-
versational strategy congruence in peer tutoring. In Proceedings of the 2015 Workshop on
Modeling Interpersonal Synchrony, 17th ACM International Conference on Multimodal In-
teraction. ACM, 2015.
35. Linda Tickle-Degnen and Robert Rosenthal. The nature of rapport and its nonverbal corre-
lates. Psychological inquiry, 1(4):285–293, 1990.
36. Torsten W¨
ortwein, Mathieu Chollet, Boris Schauerte, Louis-Philippe Morency, Rainer
Stiefelhagen, and Stefan Scherer. Multimodal public speaking performance assessment.
pages 43–50, 2015.
37. Zhou Yu, David Gerritsen, Amy Ogan, Alan W Black, and Justine Cassell. Automatic pre-
diction of friendship via multi-model dyadic features. In 14th Annual SIGdial Meeting on
Discourse and Dialogue, Metz, France, 2013.
38. Ran Zhao, Alexandros Papangelis, and Justine Cassell. Towards a dyadic computational
model of rapport management for human-virtual agent interaction. In Intelligent Virtual
Agents, pages 514–527. Springer, 2014.
39. Ran Zhao, Tanmay Sinha, Alan Black, and Justine Cassell. Automatic recognition of conver-
sational strategies in the service of a socially-aware dialog system. In 17th Annual SIGDIAL
Meeting on Discourse and Dialogue, 2016.
... Research in dyadic and small group interactions has enabled the development of automatic approaches for detection, understanding, modeling, and synthesis of individual and interpersonal behaviors, social signals, and dynamics (Picard, 2000;Gatica-Perez, 2009;Vinciarelli et al., 2011Vinciarelli et al., , 2015. For measuring interpersonal processes during an interaction such as non-verbal synchrony (Delaherche et al., 2012), rapport (Zhao et al., 2016), or engagement (Dermouche and Pelachaud, 2019), the joint modeling of all interlocutors and/or other sources of context has been frequently considered. These sources of context may include individual factors, such as sociodemographics and other attributes of each interlocutor, or shared factors, such as the history of the interaction and characteristics of the situation (Rauthmann et al., 2014). ...
Full-text available
This paper summarizes the 2021 ChaLearn Looking at People Challenge on Understanding Social Behavior in Dyadic and Small Group Interactions (DYAD), which featured two tracks, self-reported personality recognition and behavior forecasting, both on the UDIVA v0.5 dataset. We review important aspects of this multimodal and multiview dataset consisting of 145 interaction sessions where 134 participants converse, collaborate, and © 2022 C. Palmero et al. ChaLearn LAP Challenges on Personality Recognition and Behavior Forecasting compete in a series of dyadic tasks. We also detail the transcripts and body landmark annotations for UDIVA v0.5 that are newly introduced for this occasion. We briefly comment on organizational aspects of the challenge before describing each track and presenting the proposed baselines. The results obtained by the participants are extensively analyzed to bring interesting insights about the tracks tasks and the nature of the dataset. We wrap up with a discussion on challenge outcomes, and pose several questions that we expect will motivate further scientific research to better understand social cues in human-human and human-machine interaction scenarios and help build future AI applications for good.
... Context can take many forms, from the interaction partner's attributes and behaviors to spatio-temporal and multi-view information. Joint modeling of both interlocutors and/or other sources of context have been extensively considered when trying to measure interpersonal constructs [13,82], individual social behaviors [14,81] and impressions [81,59], and even empathy [59]. When considering individual attributes instead, context has often been misrepresented, in spite of extensive claims on its importance [4,74,70,56]. ...
Full-text available
Personality computing has become an emerging topic in computer vision, due to the wide range of applications it can be used for. However, most works on the topic have focused on analyzing the individual, even when applied to interaction scenarios, and for short periods of time. To address these limitations, we present the Dyadformer, a novel multi-modal multi-subject Transformer architecture to model individual and interpersonal features in dyadic interactions using variable time windows, thus allowing the capture of long-term interdependencies. Our proposed cross-subject layer allows the network to explicitly model interactions among subjects through attentional operations. This proof-of-concept approach shows how multi-modality and joint modeling of both interactants for longer periods of time helps to predict individual attributes. With Dyadformer, we improve state-of-the-art self-reported personality inference results on individual subjects on the UDIVA v0.5 dataset.
... Rapport is the scaffold of social engagement and researchers have sought to establish and maintain it in human-agent social interactions using several mechanisms (e.g., back-channeling [34], gesture mimicry and emotional alignment [18] or behavioral patterns [46]). An interpersonal process that impacts rapport directly is Social Sharing of Emotions (SSE) [29]. ...
Conference Paper
Full-text available
Social sharing of emotions (SSE) occurs when one communicates their feelings and reactions to a certain event in the course of a social interaction. The phenomenon is part of our social fabric and plays an important role in creating empathetic responses and establishing rapport. Intelligent social agents capable of SSE will have a mechanism to create and build long-term interaction with humans. In this paper, we present the Emotional Episode Generation (EEG) model, a fine-tuned GPT-2 model capable of generating emotional social talk regarding multiple event tuples in a human-like manner. Human evaluation results show that the model successfully translates one or more event-tuples into emotional episodes, reaching quality levels close to human performance. Furthermore, the model clearly expresses one emotion in each episode as well as humans. To train this model we used a public dataset and built upon it using event extraction techniques 1 .
... They have been deployed in various human-machine interactions where they can act as a tutor (Mills et al., 2019), health support (Lisetti et al., 2013;Rizzo et al., 2016;Zhang et al., 2017), a companion (Sidner et al., 2018), a museum guide (Kopp et al., 2005;Swartout et al., 2010), etc. Studies have reported that ECAs are able to take into account their human interlocutors and show empathy (Paiva et al., 2017), display backchannels (Bevacqua et al., 2008), and build rapport (Huang et al., 2011;Zhao et al., 2016). Given its relevance in human-human interaction, adaptation could be exploited to improve natural interactions with ECAs. ...
Full-text available
Adaptation is a key mechanism in human–human interaction. In our work, we aim at endowing embodied conversational agents with the ability to adapt their behavior when interacting with a human interlocutor. With the goal to better understand what the main challenges concerning adaptive agents are, we investigated the effects on the user’s experience of three adaptation models for a virtual agent. The adaptation mechanisms performed by the agent take into account the user’s reaction and learn how to adapt on the fly during the interaction. The agent’s adaptation is realized at several levels (i.e., at the behavioral, conversational, and signal levels) and focuses on improving the user’s experience along different dimensions (i.e., the user’s impressions and engagement). In our first two studies, we aim to learn the agent’s multimodal behaviors and conversational strategies to dynamically optimize the user’s engagement and impressions of the agent, by taking them as input during the learning process. In our third study, our model takes both the user’s and the agent’s past behavior as input and predicts the agent’s next behavior. Our adaptation models have been evaluated through experimental studies sharing the same interacting scenario, with the agent playing the role of a virtual museum guide. These studies showed the impact of the adaptation mechanisms on the user’s experience of the interaction and their perception of the agent. Interacting with an adaptive agent vs. a nonadaptive agent tended to be more positively perceived. Finally, the effects of people’s a priori about virtual agents found in our studies highlight the importance of taking into account the user’s expectancies in human–agent interaction.
... From a computational perspective, spatiotemporal and multiview information can be referred to as context as well. For the measurement of interpersonal constructs (e.g., synchrony [29], rapport [30]), individual social behaviors (e.g., engagement [31]) and impressions (e.g., dominance [32], empathy [15]), the joint modeling of both interlocutors and/or other sources of context has been frequently considered. However, for the task of recognizing individual attributes such as emotion and personality, context has often been misrepresented, despite recurrent claims on its importance [33,34,35,36]. ...
... Studies have shown that deeper and more meaningful learning occurs in social contexts rather than when working alone . To take advantage of this finding, it is imperative to create a virtual tutor that is sufficiently realistic and relatable so that the student engages with it in a social manner, constructing knowledge collaboratively as with a human peer or tutor and building rapport with the user (Zhao et al. 2016). As discussed previously, it is likewise important to ensure that the virtual tutor does not impede or disrupt learning, for example interrupting with unwanted hints, as this can negate any positive effects the presence of a virtual tutor may have, and ultimately makes the educational experience much less enjoyable (Conati and Manske 2009). ...
Prior to developing the Social Tutor software discussed in the remainder of this book, an investigation into existing social skills interventions, both traditional and technology-based, was conducted. Technology-based interventions include hardware and software, with some incorporating virtual and augmented reality. Here, a brief overview of some of the more influential and novel interventions are given.
... Studies have shown that deeper and more meaningful learning occurs in social contexts rather than when working alone . To take advantage of this finding, it is imperative to create a virtual tutor that is sufficiently realistic and relatable so that the student engages with it in a social manner, constructing knowledge collaboratively as with a human peer or tutor and building rapport with the user (Zhao et al. 2016). As discussed previously, it is likewise important to ensure that the virtual tutor does not impede or disrupt learning, for example interrupting with unwanted hints, as this can negate any positive effects the presence of a virtual tutor may have, and ultimately makes the educational experience much less enjoyable (Conati and Manske 2009). ...
This book highlights current research into virtual tutoring software and presents a case study of the design and application of a social tutor for children with autism. Best practice guidelines for developing software-based educational interventions are discussed, with a major emphasis on facilitating the generalisation of skills to contexts outside of the software itself, and on maintaining these skills over time. Further, the book presents the software solution Thinking Head Whiteboard, which provides a framework for families and educators to create unique educational activities utilising virtual character technology and customised to match learners’ needs and interests. In turn, the book describes the development and evaluation of a social tutor incorporating multiple life-like virtual humans, leading to an exploration of the lessons learned and recommendations for the future development of related technologies.
... Studies have shown that deeper and more meaningful learning occurs in social contexts rather than when working alone . To take advantage of this finding, it is imperative to create a virtual tutor that is sufficiently realistic and relatable so that the student engages with it in a social manner, constructing knowledge collaboratively as with a human peer or tutor and building rapport with the user (Zhao et al. 2016). As discussed previously, it is likewise important to ensure that the virtual tutor does not impede or disrupt learning, for example interrupting with unwanted hints, as this can negate any positive effects the presence of a virtual tutor may have, and ultimately makes the educational experience much less enjoyable (Conati and Manske 2009). ...
When designing a tutoring system for any population, the needs and characteristics of the specific audience need to be carefully considered. In this chapter we discuss some of the challenges individuals with specific needs may experience and suggest strategies to support these users and lead to positive educational outcomes.
... Studies have shown that deeper and more meaningful learning occurs in social contexts rather than when working alone . To take advantage of this finding, it is imperative to create a virtual tutor that is sufficiently realistic and relatable so that the student engages with it in a social manner, constructing knowledge collaboratively as with a human peer or tutor and building rapport with the user (Zhao et al. 2016). As discussed previously, it is likewise important to ensure that the virtual tutor does not impede or disrupt learning, for example interrupting with unwanted hints, as this can negate any positive effects the presence of a virtual tutor may have, and ultimately makes the educational experience much less enjoyable (Conati and Manske 2009). ...
Many factors go into creating an ECA that is engaging, appropriate for the intended purpose and educationally beneficial. The persona of the ECA is a major consideration and covers not only the appearance and voice of the ECA but also their responses and mannerisms. Another factor is perceived availability versus interference of the ECA—should it be a constant presence or just appear when needed? In which case, how do we know whether it is appearing often enough or too often, like the infamous Microsoft Clippit (Picard 2004)? Clearly, a range of issues must be considered, and what works for one learner does not necessarily work for everyone, however existing research in psychology and human–computer interaction can provide us with some guidelines.
Conference Paper
Full-text available
In this work, we focus on automatically recognizing social conversational strategies that in human conversation contribute to building, maintaining or sometimes destroying a budding relationship. These conversational strategies include self-disclosure, reference to shared experience , praise and violation of social norms. By including rich contextual features drawn from verbal, visual and vocal modalities of the speaker and interlocutor in the current and previous turn, we can successfully recognize these dialog phenomena with an accuracy of over 80% and kappa ranging from 60-80%. Our findings have been successfully integrated into an end-to-end socially aware dialog system, with implications for virtual agents that can use rapport between user and system to improve task-oriented assistance.
Conference Paper
Full-text available
We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.
Conference Paper
Full-text available
The ability to speak proficiently in public is essential for many professions and in everyday life. Public speaking skills are difficult to master and require extensive training. Recent developments in technology enable new approaches for public speaking training that allow users to practice in engaging and interactive environments. Here, we focus on the automatic assessment of nonverbal behavior and multimodal modeling of public speaking behavior. We automatically identify audiovisual nonverbal behaviors that are correlated to expert judges' opinions of key performance aspects. These automatic assessments enable a virtual audience to provide feedback that is essential for training during a public speaking performance. We utilize multimodal ensemble tree learners to automatically approximate expert judges' evaluations to provide post-hoc performance assessments to the speakers. Our automatic performance evaluation is highly correlated with the experts' opinions with r = 0.745 for the overall performance assessments. We compare multimodal approaches with single modalities and find that the multimodal ensembles consistently outperform single modalities.
Conference Paper
With the goal of assessing participant attitudes and group activities in collaborative learning, this study presents models of participation styles based on co-occurrence patterns of nonverbal behaviors between conversational participants. First, we collected conversations among groups of three people in a collaborative learning situation, wherein each participant had a digital pen and wore a glasses-type eye tracker. We then divided the collected multimodal data into 0.1-second intervals. The discretized data were applied to an unsupervised method to find co-occurrence behavioral patterns. As a result, we discovered 122 multimodal behavioral motifs from more than 3,000 possible combinations of behaviors by three participants. Using the multimodal behavioral motifs as predictor variables, we created regression models for assessing participation styles. The multiple correlation coefficients ranged from 0.74 to 0.84, indicating a good fit between the models and the data. A correlation analysis also enabled us to identify a smaller set of behavioral motifs (fewer than 30) that are statistically significant as predictors of participation styles. These results show that automatically discovered combinations of multiple kinds of nonverbal information with high co-occurrence frequencies observed between multiple participants as well as for a single participant are useful in characterizing the participant's attitudes towards the conversation.
Conference Paper
Rapport has been identified as an important function of human interaction, but to our knowledge no model exists of building and maintaining rapport between humans and conversational agents over the long-term that operates at the level of the dyad. In this paper we leverage existing literature and a corpus of peer tutoring data to develop a framework able to explain how humans in dyadic interactions build, maintain, and destroy rapport through the use of specific conversational strategies that function to fulfill specific social goals, and that are instantiated in particular verbal and nonverbal behaviors. We demonstrate its functionality using examples from our experimental data.
In this position paper we contrast a Dystopian view of the future of adaptive collaborative learning support (ACLS) with a Utopian scenario that – due to better-designed technology, grounded in research – avoids the pitfalls of the Dystopian version and paints a positive picture of the practice of computer-supported collaborative learning 25 years from now. We discuss research that we see as important in working towards a Utopian future in the next 25 years. In particular, we see a need to work towards a comprehensive instructional framework building on educational theory. This framework will allow us to provide nuanced and flexible (i.e. intelligent) ACLS to collaborative learners – the type of support we sketch in our Utopian scenario.
We introduce a temporal pattern model called Temporal Interval Tree Association Rules (Tita rules or Titar). This pattern model can express both uncertainty and temporal inaccuracy of temporal events. Among other things, Tita rules can express the usual time point operators, synchronicity, order, and chaining, as well as temporal negation and disjunctive temporal constraints. Using this representation, we present the Titar learner algorithm that can be used to extract Tita rules from large datasets expressed as Symbolic Time Sequences. The selection of temporal constraints (or time-frames) is at the core of the temporal learning. Our learning algorithm is based on two novel approaches for this problem. This first one is designed to select temporal constraints for the head of temporal association rules. The second selects temporal constraints for the body of such rules. We discuss the evaluation of probabilistic temporal association rules, evaluate our technique with two experiments, introduce a metric to evaluate sets of temporal rules, compare the results with two other approaches and discuss the results.
Reciprocal tutoring, as reported in “Exploring the design of computer supports for reciprocal tutoring” (Chan and Chou 1997), has extended the meaning and scope of intelligent tutoring originally implemented in standalone computers. This research is a follow-up to our studies on a learning companion system in the late 1980s and its network version, Distributed West, in the early 1990s. In this commentary paper, we first provide the history of and rationale behind our research. We pose and discuss six design dimensions that comprise 12 design questions. This is done on the basis of our previous experience and current knowledge as well as by reexamining the design approach, cognitive load sharing, in the original paper. Our purpose is to shed light on the future design of reciprocal tutoring. One-to-one classrooms, in which students learn with their personal computing devices (Chan et al. 2006), are becoming prevalent in practice; therefore, we expect that reciprocal tutoring—learning-by-tutoring and learning-by-being-tutored—will also become widespread.