Conference PaperPDF Available

How believable are real faces? Towards a perceptual basis for conversational animation

Authors:

Abstract and Figures

Regardless of whether the humans involved are virtual or real, well-developed conversational skills are a necessity. The synthesis of interface agents that are not only understandable but also believable can be greatly aided by knowledge of which facial motions are perceptually necessary and sufficient for clear and believable conversational facial expressions. Here, we recorded several core conversational expressions (agreement, disagreement, happiness, sadness, thinking, and confusion) from several individuals, and then psychophysically determined the perceptual ambiguity and believability of the expressions. The results show that people can identify these expressions quite well, although there are some systematic patterns of confusion. People were also very confident of their identifications and found the expressions to be rather believable. The specific pattern of confusions and confidence ratings have strong implications for conversational animation. Finally, the present results provide the information necessary to begin a more fine-grained analysis of the core components of these expressions.
Content may be subject to copyright.
How Believable Are Real Faces? Towards a Perceptual Basis for Conversational
Animation
Douglas W. Cunningham, Martin Breidt, Mario Kleiner, Christian Wallraven, Heinrich H. Bülthoff
Max Planck Institute for Biological Cybernetics
72076 Tübingen, Germany
{first.lastname}@tuebingen.mpg.de
Abstract
Regardless of whether the humans involved are virtual or
real, well-developed conversational skills are a necessity.
The synthesis of interface agents that are not only under-
standablebutalsobelievablecan be greatlyaided by knowl-
edgeof which facial motionsareperceptuallynecessary and
sufficient for clear and believable conversational facial ex-
pressions. Here, we recorded several core conversational
expressions (agreement, disagreement, happiness, sadness,
thinking, and confusion) from several individuals, and then
psychophysically determined the perceptual ambiguity and
believability of the expressions. The results show that peo-
ple can identify these expressions quite well, although there
are some systematic patterns of confusion. People were also
very confident of their identifications and found the expres-
sions to be rather believable. The specific pattern of con-
fusions and confidence ratings have strong implications for
conversational animation. Finally, the present results pro-
vide the information necessary to begin a more fine-grained
analysis of the core components of these expressions.
Keywords: perceptual models, social and conversational
agents, virtual humans and avatars, behavioral animation,
vision techniques in animation
1 Introduction
An astonishing variety of face and hand motions occur dur-
ing the course of a conversation. Many of these non-verbal
behaviorsare centralto either the flow or the meaning of the
conversation. For example, speech is often accompanied by
a variety of facial motions which modify the meaning of
what is said, e.g. [1, 2, 3, 4, 5]. Likewise, when producing
certain forms of vocal emphasis (e.g., like one would for
the word “tall” in the sentence: “No, I meant the tall duck”)
the face moves to reflect this emphasis. Indeed, it can be
exceedingly difficult to produce the proper vocal stress pat-
terns without producing the accompanying facial motion.
These facial motions are often so tightly integrated with the
spoken message that it has been argued that the visual and
auditory signals should be treated as a unified whole within
linguistics and not as separate entities [2]. Indeed, in many
instances it is not clear if the visual and the auditory acts
can be separated without significantly altering the intended
meaning. For example, a spoken statement of extreme sur-
prise has a rather differentmeaning when accompanied by a
neutral facial expression than when accompanied by a sur-
prised expression.
Non-verbal behaviors can also be used to control the
course of a conversation [6, 7, 8, 9, 10]. Cassell and col-
leagues [8, 9], for example, have created agents that utilize
head motion and eye gaze to help control the flow of the
conversation (i.e., to help control turn-taking). Even more
subtle control of the conversation is possible through the
use of back-channel responses, e.g. [11, 12]. For exam-
ple, when a listener nods in agreement, the speaker knows
that they were understood and can continue. A look of con-
fusion, on the other hand, signals that the speaker should
probably stop and explain the last point in more detail. A
look of disgust would signal that a change of topics might
be warranted, and so on.
Knowing which expressions people use during a conver-
sation is not, however, sufficient for fully accurate conver-
sational facial animation. There are many ways to incor-
rectly produce an expression, and attempts to synthesize an
expression without knowing exactly what components are
perceptually required will lead to miscommunication. Of
course, one could try to circumvent this problem by per-
fectly duplicating real facial expressions. This will not,
however, fully alleviate theproblem since facial expressions
are not always unambiguous. In other words, even within
the proper conversational context, we are not always 100%
accurate in determining what the expressions of a conver-
sational partner are supposed to mean. Moreover, in some
instances a more abstract, cartoon-like, or even non-human
embodied agent may be preferred [13], in which case full
duplication of all the physical characteristics of human ex-
pressions may not be possible. In both cases, knowledge
of how humans perceive conversational expressions would
be very helpful. What facial motions or features distinguish
1
Subject
Light
Channellink−Videobus
Sync−Cable
Master−PC
Slave−PC
LAN − Ethernet
Fileserver
Figure 1: Sketch of the 6 camera layout.
one expression from another? What makes a given instance
of an expression easier to identify than another instance of
the same expression?
While a fair amount is known about the production and
perception of the "universal expressions" (these are happi-
ness, sadness, fear, anger, disgust, contempt, and surprise
accordingto Ekman[14]), considerablyless is known about
the non-affective expressions which arise during a conver-
sation. Several points regarding conversational expressions
are, however, already clear. First, humans often produce ex-
pressions duringthe course of normal conversations that are
misunderstood, leading to miscommunication. Second, it
is possible to produce an expression that is correctly recog-
nized, but is perceived as contrived or insincere. In other
words, realism is not the same thing as clarity and be-
lievability. As interface agents become more capable and
progress into more business critical operations (e.g., virtual
sales agents), the believability of the agent will most likely
become a very critical issue. Who would buy anything from
an agent if it is obviously lying or insincere, regardless of
how good or realistic it looks?
In order to maximize the clarity, believability, and effi-
ciency of conversational agents, it would be strongly ad-
ventitious to know what the necessary and sufficient com-
ponents of various facial expressions are. Here, we lay
the groundwork for such a detailed examination by first
determining how distinguishable and believable six core
conversational facial expressions are (agreement, disagree-
ment, happiness, sadness, thinking, and confusion). Using
a recording setup consisting of 6 synchronized digital cam-
eras (describedin Section 2), we recorded 6 individuals per-
formingthese conversational expressions (describedin Sec-
tion 3). The results (described in more detail in Section 4)
show that people can identify these expressions quite well,
although there are some systematic patterns of confusion.
The specific patterns of confusion indicate several poten-
tial problem areas for conversational animation. The results
also showthat peoplewereveryconfidentoftheir identifica-
tions, even when they misidentified an expression. Finally,
the expressions were not all fully convincing; the believ-
ability of expressions varied considerably across individu-
als. Havingidentified expressions which differconsiderably
in believability and ambiguity, and having recorded the ex-
pressions from multiple viewpoints with tracking markers
placed on the faces, additional experiments will be able to
provide a more detailed description of exactly which types
offacial motion led tothe confusions, andwhichtypes of in-
formation led to the clearest comprehension of the intended
message (see Section 5).
2 Recording Equipment
To record the facial expressions, a custom camera rig was
builtusing a distributed recording system with six recording
units each of which consisting of a digital video camera, a
frame grabber and a host computer.
Each unit can record up to 60 frames/sec of fully syn-
chronized non-interlaced, uncompressed video in PAL res-
olution (768 x 576), which is stored in raw CCD format.
The six cameras were arranged in a semi-circle around
the subject (see Figure 1) at a distance of approximately
1.5m. The individuals were filmed with 30 frames/s and an
exposure time of 3 ms in order to reduce motion blur.
To facilitate later processing of the images, care was
taken to light the actors’ faces as flat as possible to avoid
directional lighting effects (cast shadows, highlights). For a
more detailed description of the recording setup, see [15].
3 Methodology
Six expressions were recorded from six different people
(three males, three females), yielding 36 video sequences.
The six expressions were agreement, disagreement, think-
ing, confusion,pleased, andsadness
1
(see Figure2 ). These
particular expressions can play an important role in the
structure and flow of a conversation, and knowledge of how
humans produce and perceive believable versions of these
expressions will be used in the construction of a conversa-
tional agent as part of the IST project COMIC (COnversa-
tional Multimodal Interaction with Computers).
For later processing of the recordings (e.g., stereo-
reconstruction), black tracking markers were applied to the
faces using a specific layout. After marker application,each
1
Both the recordings and the experiment were conducted in German.
The exact labels used for six expressions were Zustimmung, Ablehnung,
Glücklich / Zufrieden, Traurigkeit, Nachdenken, and Verwirrung.
2
(a) (b) (c)
(d)
(e) (f)
Figure 2: The six expressions. (a) Agreement; (b) Disagreement; (c) Happiness; (d) Sadness; (e) Thinking; (f) Confu-
sion. It is worth noting that the expressions are more difficult to identify in the static than in the dynamic versions.
(a)
(b)
(c) (d)
Figure 3: The many faces of thought. Despite the variability in facial contortion, all of these expressions are recogniz-
able as thoughtful expressions.
3
actor was centered in front of all six cameras, and he or she
was asked to imagine a situation in which the requested ex-
pression would occur. The “actor” was then filmed with a
neutral face first, followed by the transition to the requested
expressionand the reversal to neutral face again. The actors
were completely unconstrained in the amount and type of
movements that they could produce, with the single excep-
tion that they were asked to not talk during their reaction
unless they felt they absolutely had to. This procedure was
repeated at least three times for each emotion. The best of
each repetition for each expression from each person was
selected and edited so that the video sequence started at the
beginning of the expression and ended just as the expres-
sion started to shift back towards neutral. The length of the
sequences varied considerably. The shortest was 27 frames
long (approximately 0.9 seconds), and the longest was to
171 frames (approximately 5.7 seconds). No straightfor-
ward correlation between type of expression and length of
expression was apparent.
Each of the resulting 36 video sequences was shown to
10 different people (hereafter referred to as participants)
in a psychophysical experiment. The primary goal of psy-
chophysics is to systematically examine the functional rela-
tionship between physical dimensions (e.g., light intensity),
and psychological dimensions (e.g., brightness perception).
The work presented here examines the functional relation-
ship between rather high-level dimensions (i.e., patterns of
facial motion and the perception of expressions), and thus
might more be precisely referred to as mid- or high-level
psychophysics.
The sequences were presented at 30 frames/s on a com-
puter. The images subtended a visual angle of about 10 by
7.5 degrees. The order in which the 36 expressions were
presented was completely randomized for each participant.
Participants were given three tasks. While viewing an
expression (which was repeated until the participant re-
sponded, with 200 ms blank between repetitions), the par-
ticipant was first supposed to identify it using a multiple
choice procedure. More specifically, participants identified
an expression by selecting the name of one of the 6 expres-
sions from a list or by selecting “none-of-the-above” to in-
dicate that the expression was not on the list. Previous re-
search has shown that performance on this type of task (7
alternative, no forced-choice task) is highly correlated with
other identification tasks (e.g., free naming where partic-
ipants choose any word they want to describe an expres-
sion or the emotion behindan expression, connecting an ex-
pression with a short story, etc.), at least for the ”universal
expressions”. See [16] for more information. While these
tasks are very well suited for elucidating the relationship
between an expression (and, to some degree, the intention
behind the expression) andthe perception of that expression
(i.e., which facial motions are correlated with clear and be-
lievable facial expressions)when the expressionis shown in
isolation (i.e., not enclosed within a series of expressions),
it is less clear that these tasks can be used to determine the
central theme or message in a complex sentence or scene.
For this type of question, an approach similar to that used
by Emiel Krahmer and colleagues (where participants de-
termined in which of two sentences a given word was more
prominent) may be more appropriate; see [17].
Of course, in order to determine what role a single ex-
pression plays within a concatenation of expressions, one
must first determinewhat informationthat expressioninand
of itself carries. The same is true for examining the role of
context and multimodal expression of meaning: in order to
understand the interactive effects of context and multiple
channels with facial expressions, one must first be certain
that the facial expression used is a clear, and believable ex-
emplar of the intended message. To that end, this present
task should help to determine which facial motions are most
closely correlated with a given intended message, allowing
one to then investiget multimodal interactions.
Immediately after identifying an expression, the partici-
pants were asked to indicate on a 5 point scale exactly how
confident they were about their response. They were told
that a rating of 1 indicates that they are completely uncon-
fident (i.e., merely guessing) and 5 means they were com-
pletelyconfident. Finally, the participantswere asked to rate
from 1 (completely fake) to 5 (extremely convincing) how
believable or realistic the expressions was.
4 Results and Discussion
Overall, participants were quite successful at identifying
the expressions. Table 1 shows a confusion matrix of the
participants’ responses
2
. The pattern of responses for the
“thinking” expressions is particularly interesting: Partici-
pants thoughtthis expression was actually “confusion”20%
of the time. In many respects this is not too surprising, as
people will often stop and think when they are confused. As
such, thinking and confusion are naturally somewhat inter-
twined. Regardless, such a mistake in a conversation with
an interfaceagentcould well lead to miscommunicationand
other problems, as well as decrease the overall efficiency of
the system [18]. For example, a thinking expression might
be used as an activity index. If a user were to mistake
the agent’s thinking expression (indicating that the agent is
busy)for one of confusion(indicating that the agent is wait-
ing for more information), the user might well attempt to
clear up the perceived confusion - and speak to an already
busy system.
2
For each expression, the responses of all ten participants were col-
lapsed across each of the six “actors”. The resulting frequency histogram
of responses was converted into a percentage.
4
Participants’ Responses
Actual
Expression
Agreement Disagreement Happiness Sadness Thinking Confusion Other
Agreement 95% 0% 2% 0% 0% 0% 3%
Disagreement 0% 85% 2% 7% 0% 3% 3%
Happiness 7% 2% 73% 3% 0% 5% 10%
Sadness 0% 3% 0% 82% 5% 2% 8%
Thinking 0% 3% 0% 2% 73% 20% 2%
Confusion 0% 18% 0% 2% 5% 73% 2%
Table 1: Confusion Matrix of the identification responses. The percentage of the time a given response was chosen
(columns) is shown for each of the six expressions (rows).
Actor
Actual Ex-
pression
Actor 1 Actor 2 Actor 3 Actor 4 Actor 5 Actor 6
Agreement 100% 100% 100% 80% 100% 90%
Disagreement 90% 80% 100% 100% 90% 50%
Happiness 80% 50% 100% 60% 60% 90%
Sadness 40% 80% 90% 80% 100% 100%
Thinking 60% 90% 80% 70% 60% 80%
Confusion 90% 70% 50% 40% 90% 100%
Table 2: Actor accuracy Matrix. The percentage of the time a given expression was correctly identified is shown for
each of the six actors.
Slightly more surprising is the fact that “confusion" was
oftenmistakenfor “disagreement”. Such a misidentification
in the interaction with an interface agent would also most
likely decrease efficiency (e.g., the user might chose to de-
fend his or her position, rather than clarify it as the system
expects).
In addition to reinforcing previous warnings about using
ambiguous facial expressions [18, 13], the pattern of confu-
sions clearly demonstrates that even the perfect duplication
of real expressions would not produce an unambiguous in-
terface agent. Duplication of a confusing template will only
lead to additional confusion.
A simpler explanationforthepattern of confusionsinTa-
ble 1 would be to claim that the since the “actors” were not
trained, they might not be producing the right expressions.
While this is a research topic in and of itself, it should be
kept in mind that humans often “pretend”, producing an ex-
pression that is appropriate to the given context regardless
of whether they really feel the proper emotion. Regardless,
Table 2, which depicts the success of the different actors at
producing correctly identifiable expressions, begins to dis-
entangle “bad acting” from real confusions. The first thing
that becomesapparent froma glance at this table is the wide
degree of variation in identification scores, both within and
across expressions. Sadness is a good example of the lat-
ter: Some actors were only correctly identified 40% of the
time, while others were correctly recognized 100% of the
time. Clearly some individuals were producing the wrong
(or at least ambiguous) expression, but one cannot say that
every actor was producing the wrong expression. Variation
across expressions is well exemplified by the “Thinking”
and Agreement” expressions. All of the Agreement” ex-
pressions were identified 80% or more of the time, whereas
only one actor produced a “Thinking” expression that was
recognized more than 80% of the time. It seems, then, that
“Thinking” can be produced in a recognizable fashion, but
often is not. The interesting question here is what differs
in the image sequences that allows one to be well identi-
fied but the others not. Having identified instances of ex-
pressions that vary in terms of accuracy, future research can
begin to provide a more detailed description of what image
differences cause the perceptual differences.
While the addition of a conversational context, and the
concomitantexpectations,would no doubtimprovethe abil-
ity of participants to identify these expressions, Table 2
clearly shows that all of the expressions are potentially un-
ambiguous even without a context: Each expression was
recognized 100% of the time from at least one actor, with
the exception of “thinking” which was recognized 90% of
the time at best. Furthermore, having found the degree to
which these expressions can, by themselves, conveya given
meaning, one can begin to systematically examine exactly
how context can modulate that signal.
Participants were generally quite confident in their deci-
5
Response
Actual
Correct False
Agreement 4.67 4.0
Disagreement 4.51 3.43
Happiness 4.44 4.23
Sadness 4.08 4.19
Thinking 4.30 3.96
Confusion 4.14 3.75
Table 3: Confidence ratings. The average confidence of
the participants in their responses is listed as a func-
tion of whether they correctly identified the expression
or misidentified it. Confidence was rated on a 5 point
scale from completely unconfident (a value of 1) to com-
pletely confident (a value of 5).
Response
Actual
Correct False
Agreement 3.81 3.25
Disagreement 3.96 2.79
Happiness 3.54 3.28
Sadness 3.26 3.69
Thinking 3.91 3.33
Confusion 3.67 3.73
Table 4: Believability ratings. The average believability
ratings are listed as a function of whether the expression
was correctly or incorrectly identified it. Believability
was judged on a 5 point scale from completely unbeliev-
able (1) to completely convincing (5).
sions (see Table 3). Although they were clearly less confi-
dent when they made a mistake, they were still relatively
certain that they had correctly identified the expression.
That is, even when they made a mistake, people were rel-
atively certain that they were not making a mistake. This
confidence in one’s mistakes can have strong implications
for the design of conversational flow in general, and for the
design of an interface agent’s confusion handling and per-
suasion routines in specific (see, e.g., [19, 20, 21]).
In general, the expressions were considered rather be-
lievable (see Table 4), but were not considered completely
convincing. The participants found the expressions to be
less believable when they had incorrectly identified it. That
is, if an expression was really one of “thinking”, but a par-
ticipantthought it was “confusion”,they would be relatively
certain that the expressionwas “confusion” (i.e., confidence
ratings), but would find the expression to be somewhat un-
convincing or contrived.
5 Conclusion and Outlook
In general, these six conversational expressions are easy
to identify, even in the complete absence of conversational
context. There were, however, some noteworthy patterns
of confusion, most notably that “thinking” and “confusion”
were often misinterpreted. People were also generally con-
fident that they had correctly identified an expression, even
if they were, in fact, wrong. The specific pattern of con-
fusions and this confidence in the face of errors not only
have implications for the animation of conversational ex-
pressions, but alsofor thedesign ofan interfaceagent’s con-
versational flow capabilities.
The assymetry of the identification confusions (Table 1)
hints at a potential assymetry in the underlying perceptual
space of facial expressions. This is important to know when
analyzing and synthesizing expressions, particularly when
dealing with variability within this expression space. Vari-
ability can arise for several reasons, including the presence
of sub-categories of expressions. For example, when think-
ing, one can be pensive, contemplative,calculating,etc (see,
e.g., Figure 3). All of these are recognizable as “thinking”,
but the distribution of them in expression space may not be
symmetrical, so oneneeds to becareful whentraversing this
subregion. A second source of variability is the fact that hu-
mansare largelyincapableofexactlyduplicatinga behavior.
While the resulting minor variations in the motions accom-
panying an expression may or may not carry any commu-
nicative meaning, their absence may well be important (as
the mechanically perfect repetition would almost certainly
be recognized as unnatural).
It is, of course, possible that some of the confusion arose
from the fact that the expressions were intentionally gen-
erated (i.e., were posed). There is considerable evidence,
however, that during normal conversation humans not only
intentionally generate various facial expressions, but do so
in synchronywith the auditory portionof a conversation[2].
That is, normal conversational expressions may be, at least
in part, just as “posed” as the specific words and phrases
used in a conversation. Moreover, people generally found
the present expressions to be believable, even when they
misidentified the expressions. That is, even when people
were confused about an expression, they still found the ex-
pression to be rather believable and not contrived. This is
a rather critical point, for two reasons. First, it serves to
reinforce the idea that real does not equate with believable.
The creation of a computerized 3D model of a head that
is a perfect physical duplicate of a real human head does
not automatically mean that the expressions generated with
such a head will be unambiguous or convincing. Second,
and perhaps more important, the simulation of proper con-
versational behaviors must include socially determined be-
haviors and expressions as well as any “truly genuine” ex-
6
pression of emotion. In other words, regardless of the un-
derlying reason for why some individuals produced clearer
and more believable expressions than others, it is perhaps
more interesting to ask what portionsof the image sequence
lead people to be confused, and which componentsenhance
proper recognition. Likewise, a detailed knowledge of the
components that enhance the perceived believability of the
expression would be of great help. To that end, the fact that
the present expressions were recorded from multiple view-
points with tracking markers placed on the faces allows us
to begin to manipulate the video sequences, and additional
experiments will help elucidate which components are nec-
essary and sufficient for unambiguous, believable conver-
sational expressions. Such a knowledge would allow us to
answer in an informed fashion what we need to animate for
the end result not only to be well understood, but also be-
lieved.
6 Acknowledgments
This research was supported by the IST project
COMIC (COnversational Multi-modal Interaction
with Computers), IST-2002-32311. For more in-
formation about COMIC, please visit the web page
(http://www.hcrc.ed.ac.uk/comic/). We would like to thank
Dorothee Neukirchen for help in recording the sequences,
and Jan Peter de Ruiter and Adrian Schwaninger for fruitful
discussions.
References
[1] R. E. Bull and G. Connelly, “Body movement and emphasis
in speech, Journal of Nonverbal Behaviour, vol. 9, pp. 169
– 187, 1986.
[2] J. B. Bavelas and N. Chovil, “Visible acts of meaning - an
integrated message model of language in face-to-face dia-
logue,Journal of Language and Social Psychology, vol. 19,
pp. 163 – 194, 2000.
[3] W. S. Condon and W. D. Ogston, “Sound film analysis of
normal and pathological behaviour patterns,Journal of Ner-
vous and Mental Disease, vol. 143, pp. 338 – 347, 1966.
[4] M. T. Motley, “Facial affect and verbal context in conversa-
tion - facial expression as interjection,Human Communica-
tion Research, vol. 20, pp. 3 – 40, 1993.
[5] D. DeCarlo, C. Revilla, and M. Stone, “Making discourse
visible: Coding and animating conversational facial dis-
plays, in Proceedings of the Computer Animation 2002,
2002, pp. 11 – 16.
[6] J. B. Bavelas, A. Black, C. R. Lemery, and J. Mullett, “I
show how you feel - motor mimicry as a communicative act,
Journal of Personality and Social Psychology, vol. 59, pp.
322 – 329, 1986.
[7] P. Bull, “State of the art: Nonverbal communication, The
Psychologist, vol. 14, pp. 644 – 647, 2001.
[8] J. Cassell and K. R. Thorisson, “The power of a nod and a
glance: Envelope vs. emotional feedback in animated con-
versational agents, Applied Artificial Intelligence, vol. 13,
pp. 519 – 538, 1999.
[9] J. Cassell, T. Bickmore, L. Cambell, H. Vilhjalmsson, and
H. Yan, “More than just a pretty face: conversational proto-
cols and the affordances of embodiment, Knowledge-Based
Systems, vol. 14, pp. 22 – 64, 2001.
[10] I. Poggi and C. Pelachaud, “Perfomative facial expressions in
animated faces,” in Embodied Conversational Agents, J. Cas-
sell, J. Sullivan, S. Prevost, and E. Churchill, Eds. Cam-
bridge, MA: MIT Press, 2000, pp. 115 – 188.
[11] J. B. Bavelas, L. Coates, and T. Johnson, “Listeners as co-
narrators, Journal of Personality and Social Psychology,
vol. 79, pp. 941 – 952, 2000.
[12] V. H. Yngve, “On getting a word in edgewise, in Papers
from the Sixth Regional Meeting of the Chicago Linguistic
Society. Chicago: Chicago Linguistic Society, 1970, pp.
567 – 578.
[13] M. Wilson, “Metaphor to personality: The role of animation
in intelligent interface agents, in Proceedings of the IJCAI-
97 Workshop on Animated Interface Agents: Making them
Intelligent, Nagoya, Japan, 1997.
[14] P. Ekman, “Universal and cultural differences in facial ex-
pressions of emotion, in Nebraska Symposium on Motoi-
vation 1971, J. R. Cole, Ed. Lincoln, NE: University of
Nebraska Press, 1972, pp. 207 – 283.
[15] M. Kleiner, C. Wallraven, and H. H. Bülthoff, “The MPI
VideoLab, Max-Planck-Institute for Biological Cybernet-
ics, Tübingen, Germany, Tech. Rep. 104, 2003.
[16] M. G. Frank and J. Stennett, “The forced-choice paradigm
and the perception of facial expressions of emotion.Journal
of Personality and Social Psychology, vol. 80, pp. 75 85,
2001.
[17] E. Krahmer, Z. Ruttkay, M. Swerts, and W. Wesselink, “Au-
diovisual cues to prominence,” in Proceedings ICSLP, 2002,
pp. 1933 – 1936.
[18] D. M. Dehn and S. van Mulken, “The impact of animated
interface agents: a review of empirical research, Interna-
tional Journal of Human-Computer Studies, vol. 52, pp. 1
22, 2000.
[19] J. Jaccard, “Toward theories of persuasion and belief
change, Journal of Personality and Social Psychology,
vol. 40, pp. 260 – 269, 1981.
[20] J. J. Jiang, G. Klein, and R. G. Vedder, “Persuasive expert
systems: the influence of confidence and discrepancy, Com-
puters in Human Behavior, vol. 16, pp. 99 – 109, 2000.
[21] R. E. Petty, P. Brinol, and Z. L. Tormala, “Thought confi-
dence as a determinant of persuasion: The self-validation
hypothesis, Journal of Personality and Social Psychology,
vol. 85, pp. 722 – 741, 2002.
7
... Facial motion plays a complex and important role in communication . It can be used to modify the meaning of what is said[1] [2] [3] [4] [5], or to express meaning by itself[6] [7] [8]. Facial motion can also be used to help direct the flow of a conversation[9] [10] [11] [12] [13]. ...
... Overall, the expressions were easy to identify, despite the absence of a conversational context. This is consistent with previous work with conversational expressions[6] [7]. Recognition accuracy was surprisingly insensitive to changes in image size: Participants could recognize the expressions just as well at an image size of 64 by 48 pixels (where the face covered a mere 768 pixels) as at an image size of 512 by 384 pixels (where the face covered over 49,000 pixels ). ...
... Tables 1 through 6 show the confusion matrixes for the six image sizes. In general, the pattern of confusions at the larger image sizes is similar to those found in previous work with conversational expressions[6] [7]. Consistent with the patterns seen for the overall means (Figure 1 ...
Article
Facial expressions can be used to direct the flow of a conversation as well as to improve the clarity of communication. The critical physical differences between expressions can, however, be small and subtle. Clear presentation of facial expressions in applied settings, then, would seem to require a large conversational agent. Given that visual displays are generally limited in size, the usage of a large conversational agent would reduce the amount of space available for the display of other information. Here, we examine the role of image size in the recognition of facial expressions. The results show that conversational facial expressions can be easily recognized at surprisingly small image sizes. Copyright © 2004 John Wiley & Sons, Ltd.
... Some authors (e.g., Regenbrecht, 1999;Riecke, 2003, pp. 122-140) concentrate on the term of presence (i.e. the subjective sense of being there) as a to some degree transferable general criterion, while others rather prefer more task-specific or object-related criteria such as credibility, believability (Cunningham, Breidt, Kleiner, Wallraven, & Bülthoff, 2003), or just informational correspondence (Stappers, Gaver, & Overbeeke, 2003). As regards the scope of this dissertation one could establish affective equivalence as integrative quality and comparison criterion. ...
... Non-rigid movements of the human face, such as when we smile, talk or cry, have long been of interest to researchers studying the communicative and emotional aspects of social interaction (Cunningham et al. 2003; Rosenblum et al. 2002; Campbell et al. 1996; Bassili 1978; Kamachi et al. 2001). More recently, several groups have also begun to explore the role that characteristic facial motion might play in the assignment of individual identity (Hill and Johnston 2001; Knappmeyer et al. 2003; Lander et al. 1999; Thornton and Kourtzi 2002; see O'Toole et al. 2002 for a review). ...
Article
Full-text available
Recently there has been growing interest in the role that motion might play in the perception and representation of facial identity. Most studies have considered old/new recognition as a task. However, especially for non-rigid motion, these studies have often produced contradictory results. Here, we used a delayed visual search paradigm to explore how learning is affected by non-rigid facial motion. In the current studies we trained observers on two frontal view faces, one moving non-rigidly, the other a static picture. After a delay, observers were asked to identify the targets in static search arrays containing 2, 4 or 6 faces. On a given trial target and distractor faces could be shown in one of five viewpoints, frontal, 22° or 45° to the left or right. We found that familiarizing observers with dynamic faces led to a constant reaction time advantage across all setsizes and viewpoints compared to static familiariza-tion. This suggests that non-rigid motion affects identity decisions even across extended periods of time and changes in viewpoint. Furthermore, it seems as if such effects may be difficult to observe using more traditional old/new recognition tasks.
... Up to now, video recordings in the VideoLab were done on human facial expressions, head gestures and simple actions. While some of the video material was used " as is " for presentation to human subjects as part of experiments on face and action perception (see e.g., [16] and [17] ), some experiments required the application of computer vision and computer graphics algorithms to the video footage as a postprocessing step. Typical examples are the removal of head movements from video footage, video based manipulation of facial expressions and video based manipulation of parts of the face (see e.g., [18] and [19] ). ...
Article
Full-text available
The MPI VideoLab is a custom built, flexible digital video-and audio recording studio that enables high quality, time synchronized recordings of human actions from multiple viewpoints. This technical report describes the requirements to the system in the context of our applications, its hardware-and software equipment and the special features of the recording setup. Important aspects of the hardware and software implementation are discussed in detail.
... 22.4). Moreover, there are a number of different ways that humans express any given meaning, and not all of the resulting expressions are easily recognized272829. All these factors jointly make the recognition of facial expressions one of the most difficult tasks the human visual system can perform [7]. ...
Chapter
Full-text available
In this chapter, we will focus on the role of motion in identity and expression recognition in human, and its developmental and neurophysiological aspects. Based on results from literature, we make it clear that there is some form of characteristic facial information that is only available over time, and that it plays an important role in the recognition of identity, expression, speech, and gender; and that the addition of dynamic information improves the recognizability of expressions and identity, and can compensate for the loss of static information. Moreover, at least several different types of motion seem to exist, they play different roles, and a simple rigid/nonrigid dichotomy is neither sufficient nor appropriate to describe these motions. Additional research is necessary to determine what the dynamic features for face processing are.
Article
In this article, we present the spatial logistics task (SLOT) platform for investigating multimodal communication between 2 human participants. Presented are the SLOT communication task and the software and hardware that has been developed to run SLOT experiments and record the participants' multimodal behavior. SLOT offers a high level of flexibility in varying the context of the communication and is particularly useful in studies of the relationship between pen gestures and speech. We illustrate the use of the SLOT platform by discussing the results of some early experiments. The first is an experiment on negotiation with a one-way mirror between the participants, and the second is an exploratory study of automatic recognition of spontaneous pen gestures. The results of these studies demonstrate the usefulness of the SLOT platform for conducting multimodal communication research in both human-human and human-computer interactions.
Conference Paper
INTRODUCTION: When we move through the environment, the self-to-surround relations constantly change. Nevertheless, we perceive the world as stable. A process that is critical to this perceived stability is "spatial updating", which automatically updates ...
Conference Paper
Full-text available
The human face is capable of producing an astonishing variety of expressions---expressions for which sometimes the smallest difference changes the perceived meaning noticably. Producing realistic-looking facial animations that are able to transport this degree of complexity continues to be a challenging research topic in computer graphics. One important question that remains to be answered is: When are facial animations good enough? Here we present an integrated framework in which psychophysical experiments are used in a first step to systematically evaluate the perceptual quality of computer-generated animations with respect to real-world video sequences. The result of the first experiment is an evaluation of several animation techniques in which we expose specific animation parameters that are important for perceptual fidelity. In a second experiment we then use these benchmarked animations in the context of perceptual research in order to systematically investigate the spatio-temporal characteristics of expressions. Using such an integrated approach, we are able to provide insights into facial expressions for both the perceptual and computer graphics community.
Conference Paper
Full-text available
Stylized rendering aims to abstract information in an image mak- ing it useful not only for artistic but also for visualization purposes. Recent advances in computer graphics techniques have made it pos- sible to render many varieties of stylized imagery efficiently. So far, however, few attempts have been made to characterize the per- ceptual impact and effectiveness of stylization. In this paper, we report several experiments that evaluate three different stylization techniques in the context of dynamic facial expressions. Going be- yond the usual questionnaire approach, the experiments compare the techniques according to several criteria ranging from introspec- tive measures (subjective preference) to task-dependent measures (recognizability, intensity). Our results shed light on how styliza- tion of image contents affects the perception and subjective evalua- tion of facial expressions.
Conference Paper
Recent work has shown the potential of basic perceptual properties of motion for notification, association and visual search. Yet evidence from fields as diverse as perceptual science, social psychology and the performing arts suggest that motion has much richer communication potential in its interpretative scope. A long history of research and practice in the affective properties of motion has resulted in a bewildering plethora of potentially rich communicative attributes. What remains to be established is how and whether these perceptual effects and impressions can be computationally manipulated in a display environment as variables of affective communication. In this paper we explore attributes of expressive motion and report initial results from a study in which we explored which attributes might be most important in distinguishing motions meant to convey emotion.
Article
Full-text available
Videotapes were made of opposite-sex pairs of students in conversation with one another who had been asked to discuss items from an attitude questionnaire on which they had disagreed. At the end of the conversation, the subjects were asked to replay the videotape and to indicate which body movements of themselves and their partners they considered conveyed emphasis; these movements were then categorized using a detailed body movement system. In a second procedure, the occurrence of vocal stress was scored, together with the associated body movements. The results showed that whereas a wide diversity of primarily hand/arm movements were selected by the subjects as communicating emphasis, it was movements of all parts of the body which were related to vocal stress. It was concluded that a close relationship does exist between body movement and tonic stress, and that this can only be effectively appreciated through a body movement scoring system which enables a detailed description to be given of the visual and temporal relationship between body movement and phonemic clause structure.
Article
Full-text available
The authors propose that dialogue in face-to-face interaction is both audible and visible; language use in this setting includes visible acts of meaning such as facial displays and hand gestures. Several criteria distinguish these from other nonverbal acts: (a) They are sensitive to a sender-receiver relationship in that they are less likely to occur when an addressee will not see them, (b) they are analogically encoded symbols, (c) their meaning can be explicated or demonstrated in context, and (d) they are fully integrated with the accompanying words, although they may be redundant or nonredundant with these words. For these particular acts, the authors eschew the term nonverbal communication because it is a negative definition based solely on physical source. Instead, they propose an integrated message model in which the moment-by-moment audible and visible communicative acts are treated as a unified whole.
Article
Full-text available
X.1 Introduction In face-to-face interaction, multimodal signals are at work. We communicate not only through words, but also by intonation, body posture, hand gestures, gaze patterns, facial expressions, and so on. All these signals, verbal and nonverbal, do have a role in the communicative process. They add/modify/substitute information in discourse and are highly linked with one another. This is why facial and bodily animation is becoming relevant in the construction of believable synthetic agents. In building autonomous agents with talking faces, agents capable of expressive and communicative behavior, we consider it important that the agent express his communicative intentions. Suppose an agent has the goal of communicating something to some particular interlocutor in a particular situation and context: he has to decide which words to utter, which intonation to use, and which facial expression to display. In this work, we restrict ourselves only to the visual display of communicative intentions, leaving aside the auditory ones. We focus on facial expressions and propose a meaning-to-face approach, aiming at
Article
Full-text available
Elementary motor mimicry (e.g., wincing when another is injured) has been previously considered in social psychology as the overt manifestation of some intrapersonal process such as vicarious emotion. A 2-part experiment with 50 university students tested the hypothesis that motor mimicry is instead an interpersonal event, a nonverbal communication intended to be seen by the other. Part 1 examined the effect of a receiver on the observer's motor mimicry. The victim of an apparently painful injury was either increasingly or decreasingly available for eye contact with the observer. Microanalysis showed that the pattern and timing of the observer's motor mimicry were significantly affected by the visual availability of the victim. In Part 2, naive decoders viewed and rated the reactions of these observers. Their ratings confirmed that motor mimicry was consistently decoded as "knowing" and "caring" and that these interpretations were significantly related to the experimental condition under which the reactions were elicited. Results cannot be explained by any alternative intrapersonal theory, so a parallel process model is proposed in which the eliciting stimulus may set off both internal reactions and communicative responses, and it is the communicative situation that determines the visable behavior. (37 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Reviews research trends on the effects of persuasive messages on attitude change. It is concluded that new orientations to this area are required, and 2 general directions for theories of persuasion are proposed: (a) a theory of the acceptance of assertions and (b) a theory of the acceptance of complex messages. 270 female college students participated in an empirical investigation on the former, which supported the viability of the proposed approach. (21 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
This study examines affective facial expression in conversation. Experiment 1 demonstrates that the accuracy of affect-identification for conversational facial expressions generally is no better than chance. The explanation explored by Experiment 2 is that many conversational facial expressions operate as nonverbal interjections. Thus, much like verbal interjections (“gosh,”“really,”“oh please,”“jeez,” etc.), the attribution of affect for certain conversational facial expressions should depend on their verbal context. Experiment 2 supports the notion of facial expression as interjection by demonstrating that most any conversational facial expression, regardless of Us true source emotion or of the affect it signals in isolation, tends to be interpreted according to the affect associated with the verbal context in which it occurs. In addition to the identification of context-dependent interjection as yet another function of facial expression, the study suggests a pressing need for further investigation of nonverbal behavior in natural-conversation settings.