Conference PaperPDF Available

A webGL talking head for mobile devices

Authors:

Abstract and Figures

Luciaweb is a 3D Italian talking avatar based on the new WebGL technology. WebGL is the standard programming library to develop 3D computer graphics inside the web browsers. In the last year we developed a facial animation system based on this library to interact with the user in a bimodal way. The overall system is a client-server application using the http protocol: we have a client (a browser or an app) and a web server. No software download and no plugin are required. All the software reside on the server and the visualization player is delivered inside the html pages that the client ask at the beginning of the connection. On the server side a software called AudioVideo Engine generates the phonemes and visemes information needed for the animation. The demo called Emotional Parrot shows the ability to reproduce the same input in different emotional states. This is the first WebGL software running on iOS device ever.
Content may be subject to copyright.
A webGL talking head for mobile devices
Alberto Benin
ISTC-CNR
Via Martiri della libertà 2
35137 Padova, Italy
0039 498271819
alberto.benin@pd.istc.cnr.it
Piero Cosi
ISTC-CNR
Via Martiri della libertà 2
35137 Padova, Italy
0039 498271822
piero.cosi@pd.istc.cnr.it
G.Riccardo Leone
ISTC-CNR
Via Martiri della libertà 2
35137 Padova, Italy
0039 498271822
riccardo.leone@pd.istc.cnr.it
ABSTRACT
Luciaweb is a 3D Italian talking avatar based on the new WebGL
technology. WebGL is the standard programming library to
develop 3D computer graphics inside the web browsers. In the last
year we developed a facial animation system based on this library
to interact with the user in a bimodal way. The overall system is a
client-server application using the http protocol: we have a client(
a browser or an app) and a web server. No software download and
no plugin are required. All the software reside on the server and
the visualization player is delivered inside the html pages that the
client ask at the beginning of the connection. On the server side a
software called AudioVideo Engine generates the phonemes and
visemes information needed for the animation. The demo called
Emotional Parrot shows the ability to reproduce the same input in
different emotional states. This is the first WebGL software
running on iOS device ever.
Categories and Subject Descriptors
I.3.7 [Computer Graphics]: - Three dimentional graphic and
realism – animation, color, shading, shadowing and texture,
virtual reality..
General Terms
Design, Experimentation, Human Factors, Languages
Keywords
WebGL, talking head, MPEG-4, mobile, iOS, speech synthesis
1. INTRODUCTION
Face to face communication is the main element of human-
human interaction because both acoustic and visual signal
simultaneously convey linguistic, extra linguistic and
paralinguistic information. Therefore facial animation is a
research topic since the early 70’s and many different principles,
models and animations have been proposed over years [1, 2]. An
efficient coding of shape and animation of human face was
included in the MPEG-4 international standard [3]. At ISTC-
CNR of Padua we developed LUCIA talking head, an open
source facial animation framework [4]. With the introduction of
WebGL [5], which is 3D graphics for web browsers, we
enhanced the possibility for Lucia to be embedded in any
internet site [6]. Now it’s time for mobile devices. As far as we
know this is the first webGL native application in the world
running on iOS mobile devices.
Figure 1: Luciaweb on iOS device: text input during the
Parrot Mode.
2. SYSTEM’S ARCHITECTURE
Luciaweb follows the common client-server paradigm. First off
the client (a web browser or a mobile device application) opens a
connection with the server; the answer is an HTML5 web-page
which delivers the multimedia contents to start the MPEG4
player.
2.1 The webGL client
The typical WebGL application is composed by three parts: the
standard html code, the main JavaScript program and a new
shading language section. The html section is intended mainly
for user interaction; the JavaScript part is the core of the
application: the graphic library itself, all the matrix
manipulation, support and utility functions take place here; the
input from the user is connected with JavaScript variables via
ad-hoc event- driven procedures. The novelty is the third part
which is the Shading Language code. This software runs on the
Video Card. It is called GLSL and it derives from the C
programming language. Actually these are the instructions that
calculate every pixel color value on the screen whenever the
drawing function is called in the JavaScript main program. To be
able to change the values of the GLSL variables from the
JavaScript WebGL Application Program Interface implements
special methods to connect them with JavaScript objects, arrays
and variables. During the initialization of the WebGL page the
shader code is compiled and copied to the video card memory
ready to be executed on the Graphic Processing Unit. At the
beginning of the connection model parts data are fetched using
the lightweight data-interchange format JSON[7]. This is the
only moment where you could wait for a while because of the
amount of the data to be transmitted, while right after this phase
all the facial movements are almost real time.
2.2 The Audio Video Engine Server
Audio Video speech synthesis, that is the automatic generation of
voice and facial animation parameters from arbitrary text, is based
on parametric descriptions of both the acoustic and visual speech
modalities. The acoustic speech synthesis uses an Italian version
of the FESTIVAL di-phone TTS synthesizer [8] modified with
emotive/expressive capabilities: the APML/VSML mark up
language [9] for behavior specification permits to specify how to
markup the verbal part of a dialog in order to modify the graphical
and the speech parameters that an animated agent need to produce
the required expressions. For the visual speech synthesis a data-
driven procedure was utilized: visual data are physically extracted
by an automatic opto-tracking movement analyzer for 3D
kinematics data acquisition called ELITE [10]. The 3D data
coordinates of some reflecting markers positioned on the actor
face are recorded and collected, together with their velocity and
acceleration, simultaneously with the co-produced speech. Using
PRAAT [11], we obtain parameters that are quite significant in
characterizing emotive/expressive speech [12]. In order to
simplify and automates many of the operation needed for
building-up the 3D avatar from the motion-captured data we
developed INTERFACE [13], an integrated software designed
and implemented in Matlab. To reproduce realistic facial
animation in presence of co-articulation, a modified version of the
Cohen-Massaro co-articulation model [14] has been adopted for
LUCIA [15]
3. THE EMOTIONAL PARROT DEMO
To see the capabilities of emotional synthesis of our avatar you
can use Lucia in “Parrot Mode”: she repeats any input text you
enter and she can do it in six different emotional ways: joy,
surprise, fear, anger, sadness, disgust. For this classification we
take inspiration from [16] The demo is perfectly fluid on normal
computer (about 60 fps) while it suffers of some frames skipping
on low computational machine (7 fps on iPad 2).
4. CONCLUSION
Luciaweb is an MPEG-4 standard FAPs driven facial animation
Italian talking head. It is a decoder compatible with the
”Predictable Facial Animation Object Profile”. It has a high
quality 3D model and a fine co-articulation model, which is
automatically trained by real data, used to animate the face. It runs
on any WebGL compatible browser and now, first in the world,
on iOS mobile devices (iPhone and iPad) It reproduces six
different emotional states of the input text in “parrot mode”.
5. FUTURE WORK
In the near future we will integrate speech recognition to have
double input channel and we will test the performance of the
Mary TTS synthesis engine [17] for the Italian language. We will
work on dynamic mesh reduction to enhance the frame rate on
slow computational hardware.
6. REFERENCES
[1] F. I. Parke, F.I. and Waters, K. 1996. Computer facial
animation. Natick, MA, USA: A. K. Peters, Ltd.
[2] Ekman, P. and Friesen, W. 1978. Facial action coding system.
In Consulting Psychologist.
[3] Pandzic, I.S. and Forchheimer, R. Eds, 2003. MPEG-4
Facial Animation: The
Standard,
Implementation
and
Applications
.
John Wiley & Sons, Inc.
[4] Leone, G.R., Paci, G. and Cosi, P. 2011. LUCIA: An open
source 3D expressive avatar for multimodal h.c.i. In proc. of
INTETAIN2011
[5] WebGL, http://www.khronos.org/webgl/.
[6] Leone, G.R. and Cosi, P. 2011. Lucia-WebGL: a web based
Italian MPEG-4 Talking Head. In proc. of AVSP2011
[7] JSON, http://www.json.org/.
[8] P. Cosi, P., Tesser, F, Gretter, R. and Avesani, C. 2001.
Festival speaks italian! In proc. of Eurospeech 2001.
[9] Carolis, B.D., Pelachaud, C., Poggi, I. and Steedman, M.,
2004. Apml, a mark-up language for believable behavior
generation. Life-Like Characters, 65–85.
[10] Ferrigno, G. and Pedotti, A. 1985. Elite: A digital dedicated
hardware system for movement analysis via real-time tv
signal processing. IEEE Trans. on Biomedical Eng.
[11] Boersma, P. 1996. A system for doing phonetics by
computer. Glot International,.341–345.
[12] Drioli, C., Cosi, P., Tesser, F. and Tisato, G. 2003. Emotions
and voice quality: Experiments with sinusoidal modeling. In
proc. of Voqual 2003. Geneva,127–132.
[13] Tisato, G., Drioli, C., Cosi, P. and Tesser, F. 2005. Interface:
a new tool for building emotive/expressive talking heads. In
proc.
of
INTERSPEECH 2005.
[14] Cosi, P. and Perin, G. 2002. Labial co-articulation modeling
for realistic facial animation. In proc. of ICMI 2002.
[15] Cosi, P., Fusaro, A. and Tisato, G. 2003. Lucia a new italian
talking- head based on a modified cohen-massaro labial
co-articulation model. In proc. of Eurospeech 2003.
[16] Ruttkay, Z., Noot, H. and Hagen, P. 2003. Emotion Disc and
Emotion Squares: tools to explore the facial expression
space. Computer Graphics Forum, 22(1), 49-53.
[17] M. Schröder, M. and Trouvain, J. 2003. The German text-to-
speech synthesis system MARY: A tool for research,
development and teaching. Int. J. of Speech Technology
... Currently, official support has been given for main platforms such as Google Chrome, Firefox, Opera and more recently Microsoft Internet Explorer on desktop PCs and also on Android-based mobile devices. Starting from iOS 5, WebGL is enabled only for the advertisement library class (which is intended for placing ad-banners in applications), and we have been able to use this feature to visualize and animate our WebGL talking head [17], [18]. ...
Article
Full-text available
In this DEMO we present the first worldwide WebGL implementation of a talking head (LuciaWebGL), and also the first WebGL talking head running on iOS mobile devices (Apple iPhone and iPad).
Conference Paper
Full-text available
LUCIA is an MPEG-4 facial animation system developed at ISTC-CNR. It works on standard Facial Animation Parameters and speaks with the Italian version of FESTIVAL TTS. To achieve an emotive/expressive talking head LUCIA was built from real human data physically extracted by ELITE optic-tracking movement analyzer. LUCIA can copy a real human being by reproducing the movements of passive markers positioned on his face and recorded by the ELITE device or can be driven by an emotional XML tagged input text, thus realizing true audio/visual emotive/expressive synthesis. Synchronization between visual and audio data is very important in order to create the correct WAV and FAP files needed for the animation. LUCIA's voice is based on the ISTC Italian version of FESTIVAL-MBROLA packages, modified by means of an appropriate APML/VSML tagged language. LUCIA is available in two different versions: an open source framework and the "work in progress" WebGL. © 2012 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering.
Article
Full-text available
Voice quality is recognized to play an important role for the rendering of emotions in verbal communication. In this paper we explore the effectiveness of a sinusoidal modeling process- ing framework for voice transformations finalized to the anal- ysis and synthesis of emotive speech. A set of acoustic cues is selected to compare the voice quality characteristics of the speech signals on a voice corpus in which different emotions are reproduced. The sinusoidal signal processing tool is used to convert a neutral utterance into emotive utterances. Two dif- ferent procedures are applied and compared: in the first one, only the alignment of phoneme duration and of pitch contour is performed; the second procedure refines the transformations by using a spectral conversion function. This refinement im- proves the reproduction of the different voice qualities of the target emotive utterances. The acoustic cues extracted from the transformed utterances are compared to the emotive original ut- terances, and the properties and quality of the transformation method are discussed.
Article
Full-text available
Developing an embodied conversational agent able to exhibit a human-like behavior while communicating with other virtual or human agents requires enriching the dialogue of the agent with nonverbal information. Our agent is defined as two components: a Mind and a Body. Her mind reflects her personality, her social intelligence as well as her emotional reaction to events occurring in the environ-ment. Her body corresponds to her physical appearance able to display expressive behaviors. We designed a Mind-Body interface that takes as input a specification of a discourse plan in an XML language (DPML) and enriches this plan with the communicative meanings that have to be attached to it, by producing an input to the Body in a new XML language (APML). Moreover we have developed a language to describe facial expressions. It combines facial basic expressions with operators to create complex facial expressions. The purpose of this paper is to describe these lan-guages and to illustrate our approach to the generation of behavior of an agent able to act consistently with her goals and with the context in which the conversation takes place.
Conference Paper
Full-text available
LUCIA, a new Italian talking head based on a modified version of the Cohen-Massaro's labial coarticulation model is described. A semi-automatic minimization technique, working on real cinematic data, acquired by the ELITE opto- electronic system, was used to train the dynamic characteristics of the model. LUCIA is an MPEG-4 standard facial animation system working on standard FAP visual parameters and speaking with the Italian version of FESTIVAL TTS.
Article
In the paper we present two novel interactive tools, Emotion Disc and Emotion Squares, to explore the facial expression space. They map navigation in a 2D circle, by the first tool, or in two 2D squares, by the second tool, to the high-dimensional parameter space of facial expressions, by using a small number of predefined reference expressions. They can be used as exploration tools by researchers, or as control devices by end-users to put expressions on the face of embodied agents or avatars in applications like games, telepresence and education.
Article
This paper introduces the German text-to-speech synthesis system MARY. The system's main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the user to access and modify intermediate processing steps without the need for a technical understanding of the system is described, along with examples of how this interface can be put to use in research, development and teaching. The usefulness of the modular and transparent design approach is further illustrated with an early prototype of an interface for emotional speech synthesis.