Conference PaperPDF Available

Voice Driven Type Design


Abstract and Figures

With voice driven type design (VDTD), we introduce a novel concept to present written information in the digital age. While the shape of a single typographical character has been treated as an unchangeable property until today, we present an innovative method to adjust the shape of each single character according to particular acoustic features in the spoken reference. Thereby, we allow to keep some individuality and to gain additional value in written text which offers different applications – providing meta-information in subtitles and chats, supporting deaf and hearing impaired people, illustrating intonation and accentuation in books for language learners, giving hints how to sing – up to artistic expression. By conducting a user study we have demonstrated that – using our proposed approach – loudness, pitch and speed can be represented visually by changing the shape of each character. By complementing homogeneous type design with these parameters, the original intention and characteristics of the speaker (personal expression and intonation) are better supported.
Content may be subject to copyright.
Voice Driven Type Design
Matthias Wölfel
School of Digital Media
Furtwangen University
Furtwangen, Germany
Tim Schlippe
Research Karlsruhe
Karlsruhe, Germany
Angelo Stitz
School of Design
Pforzheim University
Pforzheim, Germany
AbstractWith voice driven type design (VDTD), we introduce
a novel concept to present written information in the digital age.
While the shape of a single typographical character has been
treated as an unchangeable property until today, we present an
innovative method to adjust the shape of each single character
according to particular acoustic features in the spoken reference.
Thereby, we allow to keep some individuality and to gain
additional value in written text which offers different applications
providing meta-information in subtitles and chats, supporting
deaf and hearing impaired people, illustrating intonation and
accentuation in books for language learners, giving hints how to
sing up to artistic expression. By conducting a user study we have
demonstrated that using our proposed approach loudness,
pitch and speed can be represented visually by changing the shape
of each character. By complementing homogeneous type design
with these parameters, the original intention and characteristics of
the speaker (personal expression and intonation) are better
Keywordstype design, typography, responsive type, speech
analyzis, speech representation, adaptive character shape.
The beginning of cultural evolution started with personal
communication, first in oral and later also in written form. Both
forms, oral and handwritten, do not only include the transfer of
pure information, but are also a form of personal expression.
This, however, changed with the invention of using movable
components (usually individual letters and punctuations) to
reproduce the elements of a document. The world’s first known
movable type system was created in China around 1040 by Bi
Sheng [2]. But not before the introduction of the movable-type
printing system in Europe by Johannes Gutenberg around the
1450s [3], movable types could demonstrate their superiority. In
contrast to the thousands of characters needed in the Chinese
writing system, European languages need a much lower number
which makes it much easier to handle. After their invention in
the 1860s, typewriters became a convenient tool for practically
all written communication and quickly replaced handwriting
except for personal correspondence [4]. While industrialization
necessitated standardization of type in that replication process
See for example the controversially discussed field of graphology which is
interested in analyzing the physical characteristics and patterns of handwriting
to identify the writer, indicate the psychological state at the time of writing, or
evaluate personality characteristics [1].
In contrast to keyboards for text input, musical keyboards offer a range of
expression types including velocity sensitivity (how fast a key is pressed),
and their materials [4], digitization offers a liberation of these
stringent formats: Fonts have been developed in all kind of
flavors. But keys, since the invention of the typewriter, stayed
the preliminary input modality. This has also not been changed
in the late 1960s by the invention of the word processor [4]. Due
to this restriction of keys which provide only two states
and time information, we are not able to express individual
These individual characteristics, intonation as well as the
emotional state can then only be ‘reconstructed’ either by
explicit reference (e.g. “I’m happy!”) or other verbal (e.g.
linguistic behavior such as changes in disagreement, affect
terms, and verbosity) or nonverbal cues (e.g. use of punctuation
or emoticons [5]) [6]. Speech as an alternative form of input
modality, in contrast to a keyboard, contains more information.
This is complementary to text and reflects individuality and
emotion. However, it is simply thrown away and not used to
influence the type design or represented in other graphical form.
We got so used to this generic transfer of information that
nobody challenges this way of representing information. What
gets lost becomes more obvious, for instance, if we transform
written text back to speech with ‘simple’ speech synthesis
Without emotion, prosody and personalization listening to
speech gets very boring very soon.
In order to keep the additional information present in verbal
communication also in text-based communication, we introduce
a novel concept: In voice driven type design (VDTD) the shape
of a single character adjusts according to particular acoustic
features in the spoken reference, such as loudness, pitch, and
Using typography as a stylistic device has a very long
tradition which comes in different flavors. It can be:
static such as presented in printing books, posters and
comics, where type represents content in a uniform and
permanent way [7],
pressure sensitivity (amount of force on a held-down key) and displacement
sensitivity (distance that a key is pressed down).
With ‘simple’ we refer to speech synthesis which does not use linguistic
analysis to estimate intonation and duration.
dynamic as in kinetic typography which is an animation
technique to express ideas using text-based video
animation [8, 9], or
reactive as in responsive type [10] where the shape of
each letter is adjusted according to the properties (such
as age, eye sight, relative position, and speed) of the
While the previously mentioned approaches, in general, use
a uniform character style, sound poetry ignores these constrains.
Sound poetry is probably best described as an artistic form of
performing a written text focusing on the phonetic aspects of
speech besides its semantic values [11]. For instance, each line
of the poem Karawane, by one of the pioneers of Dadaist poetry
Hugo Ball [12], reflects through different typographic styles
the acoustic dimension of a linguistic sign” [13].
Another approach, instead of manipulating the look of the
text itself, is to use emoticons yet to express emotions instead
of an acoustic dimension. Even though emoticons have been
demonstrated to be effective for remote emotional commu-
nication [14], additional characters have to be added and
variation within text cannot be represented well. Another
drawback of this approach is that not all emoticons are
interpreted equally between different cultures [15].
Automatic processing of speech is a research topic with a
rather long tradition. However, speech as well as emotion
recognition on acoustic features have been and still are treated
as independent entities. On one hand, automatic speech
recognition systems are still limited to recognizing what has
been said without being concerned about how. On the other
hand, emotion recognition systems aim to classify the type of
emotion that lies within the acoustic speech signal. How those
emotions could be expressed in text-based communication has
not been widely investigated [16].
Using acoustic cues to determine the look of text or
additional hints such as emoticons, to the best of our knowledge,
has only been started to be investigated recently: Zimmermann
[17] proposed to select pre-existing emoticons from multimedia
data (including video, still image, and/or audio) captured by a
device. Furthermore, he also proposed to generate emoticons
based on the expressions on the user’s face. Matsumiya et al.
[16] have proposed and investigated how to automatically
generate the shape and appearance of text balloons based on
linguistic and acoustic speech features. Given a chance rate of
73% to estimate the shape of text balloons on their comics-anime
test corpus, they reached 87% accuracy and demonstrated that
subtitling with text balloons is better than that with static text.
As described in the previous section, related work has been
limited to a sentence-based selection of appropriate text
fragments corresponding to spoken utterances. In contrast, the
focus of our work is to present and investigate a granularity with
a much higher resolution, namely on each single phoneme and
their corresponding grapheme or graphemes, respectively: We
propose to vary the shape of a single character according to
The transcription can either be given a-priori or automatically generated with
automatic speech recognition.
particular acoustic features in the spoken reference. Our
motivation of a grapheme-level adaptation of the transcription is
to better represent the characteristics of the spoken utterance and
to keep individuality in written text. This strategy allows
additional meta-information in subtitles and chats, supporting
hearing impaired and deaf people, illustrating intonation and
accentuation in books for language learners and giving hints
how to sing. These would be more limited or even not possible
at all with previous approaches.
Fig. 1. From speech signal to voice driven type designed text.
A. Algorithm
As illustrated in Figure 1, to retrieve VDTD given a spoken
utterance and its transcription
consists of the following steps:
1) Phoneme alignment and acoustic feature extraction
Generate phoneme transcription and determine the
beginning and end of each phoneme.
Determine loudness and pitch (step size 10 ms)
given the acoustic signal.
2) Phoneme-to-grapheme alignment
Align each phoneme or phoneme sequence to one
or more graphemes.
3) Features-grapheme alignment
Determine the speed parameter of each grapheme
by using the beginning and end times of the
corresponding phoneme or phoneme sequence.
Determine the loudness and pitch parameters of the
grapheme by averaging loudness and pitch
according to the beginning and end times of the
corresponding phoneme or phoneme sequence.
4) Type design
Generate the shape of each character according to
the corresponding normalized (mean and variance)
and mapped features loudness, pitch and speed.
To retrieve the phoneme transcription and to align the audio
sequence to the word sequence, we used the Munich Automatic
Segmentation System MAUS [18]. The acoustic features were
extracted using our own code which analyzes weighted Fourier
spectra to get loudness and cross-correlation to determine pitch.
To align the phoneme sequence to the grapheme sequence, we
applied the m2m-aligner [19].
1) Normalizing the speech parameters
The mean and variance values of the acoustic features vary
between the phonemes and phoneme classes. For example, a
vowel is significantly louder than a fricative and has a wider
range in loudness. Since the goal of our proposed approach is to
visualize acoustic variation and not to project such differing
ranges across phonemes and phoneme classes, we need to
compensate for the different means and variances per phoneme
before applying the features. Furthermore, the generated
typographical character sequence would consist of uneven
distributed characteristics per phoneme class. This would, for
instance, result in a sequence where all vowels would always be
displayed with a wider stroke than fricatives.
By normalizing each phoneme class (c) according to
𝐩𝑐= 0.5 + 0.25 ∗ (𝐩𝑐− 𝛍𝑐)/𝛔𝑐;
where (p) denotes the acoustic parameters, (𝛍) represents the
mean values and (𝛔) the standard deviation, a homogeneous
design where only the variation of each phone per class is
pronounced can be provided. The mean and standard deviation
are calculated from a training set consisting of various utterances
by different speakers. To guarantee that the resulting parameters
lie in the range of 0 and 1, the values are normalized. Then the
normalized acoustic parameters (loudness, pitch and speed) of a
phoneme are handed over to the corresponding grapheme or
grapheme sequence to form the graphical parameters (vertical
stroke weight, horizontal stroke weight, and character width).
2) Correspondence between phonemes and graphemes
The relationship between graphemes and phonemes varies
among languages [20, 21]. Languages with alphabetic scripts are
characterized by four types of relationships between phonemes
and graphemes: a 1-to-1, a p-to-1, a 1-to-g and a
p-to-g mapping, where p and g are integer values greater than 1.
Therefore, in the simplest case, one phoneme represents one
grapheme. In all other cases a 1-to-1 relationship is not given.
In case of a close grapheme-to-phoneme relationship, such
as German, mostly one character represents one phoneme.
Consequently, each character can be adapted based on the
acoustic characteristics of the corresponding phoneme. To deal
with all kind of phoneme-to-grapheme relationships, we apply
the following strategy which is also demonstrated in Figure 2:
1) Perform an automatic forced alignment process which
aligns one phoneme to one up to three corresponding
graphemes [22].
2) Transfer the characteristics of the phoneme to the
corresponding grapheme.
For German and English, a mapping of a phoneme to up to
three characters has proved to be successful (e.g. for English igh
- /aɪ/ and sh - /ʃ/). Diphthongs (e.g. /aɪ/, /ɪə
̯/ and /aʊ/) are handled
as one phoneme. However, this mapping can be easily adapted
according to the grade of relationship of new target languages.
Fig. 2. Mapping the acoustic features of a phoneme sequence to the visual
features of the corresponding grapheme sequence.
B. Visualization
Today’s typefaces hold only a limited range of mostly nine
fonts depending on different nuances for stroke weight (light,
regular, black) and character width (extended, regular,
condensed). However, if we want to map continuous
characteristics of the voice into a visual representation, we do
not only need mapping functions for the different acoustic
parameters, but a continuous visual representation. Therefore, an
apparatus to change the character on a continuous scale is
required. In addition, we have the requirement that the proposed
approach has to work for longer text passages which constrains
the degree of freedom in our visual expression. For now we have
decided to work with the parameters vertical stroke weight,
horizontal stroke weight and character width. The three free
parameters are demonstrated in Figure 3.
Fig. 3. Demonstrating the three freely adjustable parameters vertical stroke
weight, horizontal stroke weight and character width.
Other possible parameters such as height, contrast,
sharpness, and skewness have been decided to not be considered
for the reasons just given. But other freely adjustable parameters,
of course, can be considered in future work.
To express a bandwidth of emotional states through written
text leads to the conclusion that the origin point of a dynamic
character shape must constitute a simple, generic, rational and
reduced form. Therefore, we adopted one of the most
satisfactory, modernist sans serif typefaces of the twentieth
century “Futura” by Paul Renner (1927) [23, p. 80].
1) Continuous scale of character shape
Since, to the best of our knowledge, no type family or
software exists which is able to fulfill our particular needs, we
designed our own type family and developed a font processing
tool. This enables to change every parameter of each character
in real time without loosing distinctive and aesthetic character
shapes. To guarantee a functional, stringent and aesthetic
character shape with our automatic mapping function, we
manually defined the extrema of our continuous space as
demonstrated in Figure 4 with the three type design
dimensions vertical stroke weight, horizontal stroke weight
and character width.
Fig. 4. Continuous interpolation between each parameter. Each parameter runs
in a range of 0 and 1. The values of the parameters are represented in a triplet
(vertical stroke weight, horizontal stroke weight, character width). The average
font (0.5, 0.5, 0.5) is located in the center of the space.
2) Mapping voice characteristics to character shape
Every single piece of information sent and perceived by
humans is embedded in a particular manner of formatting which
goes far beyond the transfer of pure information. Adding values
present in acoustic signals into a visual representation makes
only sense if it can be interpreted to ‘extract’ the original
meaning. Thus, a comprehensible and reasonable relationship
between speech characteristics and the character shape has to be
found. This requires that common principles in formatting exist
in speech and typography and that these principles can be well
mapped. After these considerations, we decided to map the voice
to the character shape as follows:
Loudness: Producing loudness in speech amplifies the
signal and is usually used to have the attention of a
listener. To have the attention of the reader, bolder text
produced with more stroke weight is commonly used
since it makes it easier and more efficient to scan the text
and recognize important keywords [24]. Increasing the
stroke weight commonly effects the vertical and
horizontal stroke weight equal. To recognize each
acoustic feature separately after the mapping on its visual
representation, we decided to increase only the vertical
stroke weight. Contrary to the adjustment of the
horizontal stroke weight, increasing the vertical stroke
weight is more common and attracts the attention of the
reader which can be explained with the historical
development of the classified styles of types [25].
Pitch: Numerous studies confirm that emotional
expression of utterances is formed by variations of pitch
levels [26, 27, 28]. High pitch levels draw attention of a
listener and express additional emotions, such as joy,
anxiety or fear, while medium pitch levels account for
more neutral attitudes [29]. While we adjust the vertical
stroke weight according to the loudness, we adapt the
horizontal stroke weight depending on the pitch level
since this modification increases the reader’s curiosity.
Due to this aspect, the inverse-contrast (Italian) [30]
fonts with significant vertical stroke weight have a high
recognition factor. Nicolete Grey wrote about these
Italian typefaces: “a crude expression of the idea of
perversity” [31], while others call it “degenerated [32].
Consequently, we learn that adjusting the horizontal
stroke weight touches the reader’s emotions and fits to
express pitch.
Speed: The processes of information transfer with speech
and reading happen within a time period. A reader
usually jumps from a part of a word to a next part of a
word [33]. Increasing the character width extends this
scanning process of the eyes. Therefore, we map the
speed of the utterance to the character width.
Fig. 5. Mapping speech characteristics on text formatting.
Figure 5 summarizes the mapping of the acoustic
characteristics loudness, pitch and speed to its visual
representations vertical stroke weight, horizontal stroke weight
and character width.
This section presents some possible applications where
VDTD can be used to support given text by providing additional
A. Language learning and speech-language pathology
Many people learning a new language have trouble doing
the intonation and accentuation in a right way. A potential of
VDTD is that it allows to illustrate intonation and accentuation
in software and books for language learners.
Fig. 6. Language learning with VDTD.
Figure 6 illustrates the VDTD representation of the Spanish
word “attencíon” meaning “attention”. While the “accented i”
indicates an intonation of the “i” in the static text, one has no
glue about the intonation in the beginning of the word.
However, with the help of VDTD, the stress of the “a” is
In addition to language learning, VDTD indicates in other
fields how to pronounce words, e.g. in speech pathology. It
enables to catch additional information which may not be
possible with exclusively acoustic information. For instance the
change of loundness within a spoken word migth not be
recognizided by the listener, but if he sees its VDTD
transcription he might be able to see the differences.
B. Hints for deaf people
Deaf people profit from VDTD since it gives them hints on
how something was spoken. This is helpful in different
situations: (1) learning how to pronounce (similar to language
learning and speech-language pathology) or (2) interpreting how
something is intended.
There have been several efforts even by big organizations
such as BBC and Amnesty International to make the life after
hearing loss more comfortable (e.g. The Future of Subtitling
Our approach has the potential to enrich the television and
cinema experience for deaf people.
C. Hints for dyslexia
A reading disorder is primarily influenced by the so-called
phonological awareness [3]. It refers to an individual awareness
to the phonological structure of words [35]. Phonological
awareness involves the detection and manipulation of sounds at
three levels of sound structure: (1) syllables, (2) onsets, rimes
and (3) phonemes [36] and has also an impact during the process
of reading and writing [37]. People with an unsatisfactory
phonological awareness are not able to extract the correct
orthographic word out of a spoken utterance. VDTD offers a
new way to visualize a comprehensible relationship between the
spoken utterance and the written text.
D. Subtitles
Subtitles on television screens are very popular in places
with either a lot of background noise (in the gym, on stations, at
the airport) or where different programs are simultaneously
broadcasted (in sports bars where several football games,
baseball games or basketball games are shown). They are
switched on in the gym, in sports bars, on stations, at the airport
and other places. Figure 6 shows a scene during a touchdown
a very emotional experience at least for someone who is excited
in this sport. While a static subtitle does not transfer the TV
host’s emotion, the subtitle modified with our approach reflects
the host’s admiration to carry the ball for over 80 yards and his
Fig. 7. Scene of a football game with VDTD subtitles.
E. Texting
People love texting be it on the smartphone, on Twitter,
during gaming, etc. Additionally, using automatic speech
recognition to automatically generate the text message and
sending voice messages has become more and more popular.
Emoticons help us to add meta-information transferring irony
and emotions. If it is not possible to listen to the voice messages,
there are already services transcribing them and sending them
back in text format. However, in the static text of the
transcription, meta-infromation is lost and no emoticons help
the receiver. VDTD can support such a scenario.
Fig. 8. Chatting with the support of VDTD text.
Figure 8 demonstrates a snippet of a chat. Given only static
text it is not clear if Rajat is serious with his question about their
Word Cup’s plans. Using VDTD could help here to decode
irony as meta-information into the question.
F. Singing and karaoke
Voice-driven type designed songbooks can support singing
songs without the requirement of being able to read notes since
loudness, pitch and speed are illustrated in a more
comprehensible form and provides additional information to
those who are able to read and interprete notes.
Fig. 9. Singing with VDTD (Song: Queen We will rock you).
Figure 9 and 10 demonstrate the refrain of Queen’s song
“We will rock you”. Using VDTD the pitch decrease of the first
four words is visible. Additionally, we have information about
the loudness which is mostly not represented in notes like in
this example. With the proposed approach it can be observed
that the word “rock” has to be sung much louder. VDTD also
provides information with a more detailed resolution: While the
notes provide a single tone a whole word is sung. Given only
the text it is unclear how long each phoneme has to be
pronounced. In the given example we see that the ‘e’ in both
‘we’s and the ‘ou’ in ‘you’ have to be pronounced longer than
the other phonemes.
Fig. 10. Karaoke TV screen using VDTD (Song: Queen We will rock you).
VDTD is also helpful to indicate how to sing in Karaoke.
This is demonstrated in Figure 10 where the change beween red
text to green text indicates the current position of the singer.
To evaluate our approach, we performed a user study to
whether people agree that VDTD can be used for several
“Erlkönig”, written in 1782, is a very emotional poem with a tragical ending
consisting of 8 verses with each 4 lines. The length of a recording of the 32 lines
is approximately 2 minutes on average.
whether loudness, pitch and speed of the original speech
signal are represented well by the mapping function.
whether the proposed visual representation can represent
whether speaker-specific characteristics can be observed.
A. Experimental Setup
For our experiments, we asked four persons to read
Wolfgang Goethe’s German poem “Erlkönig” (Erl-King) [38]
We used close-talking microphones to provide recordings
without distortions which could reduce the reliability of the
extracted acoustic features [39]. In order to investigate similarity
patterns for individual speakers, we recorded each speaker
twice. Therefore, our dataset consists of 8 audio files from 4
The participants of our user study were randomly selected
volunteers who participated free of charge. In total, 9
participants (6 female, 3 male) were exposed to VDTD and
asked to fill out a questionnaire. 5 subjects were in the age range
between 20 and 30 years, 1 subject between 30 and 40, 1
between 40 and 50 and 2 above 50. 6 subjects were design
professionals or students, and 3 had no particular design
background. Our goal was to have both, participants with more
sense of typography and participants with less sense of this topic.
The participants’ mother tongue is German. They understand
well the tragic poem with its emotional aspects, most of them
know it from school. Except for the survey which will be
described in Section E, the score range, followed the rules of a
forced choice Likert scale, which ranged from (1) strongly
agree, (2) agree, (3) neutral, (4) disagree to (5) strongly
B. Acceptance
The question if VDTD is interesting for several applications
has been answered by most participants positively with an
average score of 1.5 a result between strongly agree and agree.
They see its applications in particular in poems, in learning a
second language, in providing hints for acting and in artistic
C. Representation of Voice Characteristics
In this experiment, subjects listened to one recording of the
poem and graded how good each verse of five texts represents
the loudness, pitch and speed of the recording. Table 1 gives the
results VDTD: our algorithm has been used; R1 and R2:
randomly chosen parameters; and HTD: homogenous type
design (static text).
We see that our approach results in the best scores and that
loudness and pitch are better represented than speed. Random
variation result also in positive evaluations in contrast to HTD.
This can be explained by clustering-illusion [40] and Apophenia
[41] humans seek sense in random patterns.
In this context it is also interesting to note that those
volunteers who have a strong linguistic background compared
and analyzed the different visualizations more carefully. Some
discovered that, in the case of random changes, the visualization
is not consistent to the way people pronounce the given words.
Acoustic Feature
D. Representation of Emotions
In this experiment, the subjects were asked how each text
reflects the emotions expressed. The subjects were given three
texts as shown in Figure 11 1: represents homogenous type
design and got an average score of 4.44; 2: represents the
random approach and got an average score of 2.67; and
3: represents VDTD and got an average score of 2.33. It is
interesting to note that people slightly agree that emotions are
present even though the parameters are randomly chosen. Our
approach, however, is agreed to represent more information.
Fig. 11. Example of text generated by homogenous, random and voice driven
E. Representation of Speakers’ Characteristics
To determine if the visualization representation preserves
individual characteristics of a speaker, we have calculated the
standard deviation of the visual variation, represented by the
acoustic features, between utterance pairs of the same speaker as
well as from two different speakers. Comparing the standard
deviations in Table 2 of the three normalized features (to range
between 0 and 1) loudness, pitch and speed, we see that all of
them are significantly smaller for the same speaker in contrast to
different speakers.
Acoustic Feature
Standard Deviation
Same Speaker
Different Speaker
But are these parameters represented well in the shape of the
characters? We asked our participants to find four utterance
pairs given 8 utterances from 4 speakers, see Figure 12. The
results in Table 3 show that four participants found all four
pairs, a probability of 1 to 105 each. We, therefore, can
conclude that the visual representation gives some clue about
speaker specific properties.
Fig. 12. Individual speech characteristics are also visible in spoken text due to
VDTD. Utterance pairs of the same speakers (1,2), (3,4), (5,6), and (7,8).
Number of Utterance Pairs
Found x Times
We have proposed VDTD as a novel concept to represent
additional information present in verbal communication also in
text-based communication. To keep and convert this
information, we change the shape of each single character
according to particular acoustic features.
Potential applications of our proposed approach to visualize
the characteristics present in the voice in the type of the
characters are manifold: It has the potential to support learning
to read and speaking a foreign language. It can provide hints for
actors on the intended prosody. Furthermore, it offers novel
possibilities in subtitles on television screens which might be
particularly beneficial for hearing impaired and deaf people.
VDTD can be helpful in singing and karaoke and adds meta-
information to transcribed voice messages.
In addition to the experiments described in this paper, we have
gained positive experiences and feedback with our interactive
VDTD demo system which we presented at the exhibition
GLOBALE: Infosphere
hosted by ZKM Karlsruhe (Center for
Art and Media) [42] and at the conference Mensch & Computer
(MuC 2015) [43] in Germany. In this system we let the audience
read poems aloud and then automatically provide them the voice
driven type designed text on a screen.
Having presented and discussed potential VDTD application
scenarios and completed initial analyses of acceptance and
information gain, we plan further analyses and usability studies
to find optimal features and conditions for the applications. For
example, we have demonstrated that the modification of vertical
and horizontal stroke weight plus character width works to give
the reader hints about the loudness, pitch, and speed of spoken
utterances. How strong the shape of the character has to be
changed has not been investigated so far and might be even
depending on the application. For instance, for characters in
subtitles of serious anker speakers, a smaller interpolation space
of the characters may be chosen than for the characters of
subtitles in football games, baseball games or basketball games
to represent more emotions. Besides the context, the strength of
the changes may also depend on the font size. Another
interesting question is if the change in character shapes has to be
evident to the speaker or if it can already support reading if the
changes are so small that they are not consciously recognized.
Future work may include a comparison and analysis of the
acceptance of other modifications in the typography depending
on the application scenario, e.g. the sharpness to represent pitch
variations. We have transferred the characteristics in the voice at
the character level since our intention was to be able to represent
differing spoken characteristics in different positions of the
word. For certain applications or languages it may be better to
turn to a syllable or word level.
We have evaluated our experiments with German, our
mother tongue, a Germanic language with a fairly regular
grapheme-to-phoneme relationship [20]. Future work will
contain additional evaluations to languages with a more
ambiguous relationship, e.g. English, where one letter can
represent a variety of sounds conditioned by complex rules and
many exceptions [44]. Further work could also include the
application to other writing systems such as Arabic, Hebrew,
Greek, Kyril, Hangul, Hiragana and Katagana. How to adapt
VDTD to logograms and pictograms as used in Chinese
characters could be another interesting question to be
A further challenge is to tackle numbers. To generate the
corresponding phoneme sequence, a sequence of digits needs to
be transferred to characters with the help of number spellers.
Then, we can adapt the characters with our approach. However,
how to transfer the voice characteristics to the single digits? This
is a challenge since much more phonemes correspond to single
digits than in the case with words composed of characters (e.g.
24 = twenty four). Moreover, in contrast to words, numbers are
not pronounced from left to right consistently in all languages.
For example, in German the number “24” is pronounced as
Further topics are the adaptation of punctuation marks and
whitespaces between word tokens. Is it beneficial to modify the
shape of a comma, question mark, exclamation mark, etc.
according to the characteristics of the previous character or not
at all? For aesthetics it can make sense to transfer the
characteristics of the previous character. So far we have not
adapted the whitespace to the duration of the break between two
words. However, an adaptation according to the time
information is promising, for example to display breaks in
karaoke applications.
So far we have applied our method on sans serif letters.
Another possible application would be to apply it to imitate
handwriting or calligraphy. Using VDTD for the automatic
production of handwriting and calligraphy may lead to a more
realistic type face variation in contrast to random or rule-based
variations of the type face.
[1] J. Berger, Handwriting Analysis and Graphology. In M. Shermer (ed.)
The Skeptic Encyclopedia of Pseudoscience, ABC-CLIO, pp. 116-120,
[2] J. Needham and T. Tsuen-Hsuin, Science and Civilisation in China:
Volume 5, Chemistry and Chemical Technology, Part 1, Paper and
Printing, ser. Science and Civilisation in China. Cambridge University
Press, 1985.
[3] R. Kinross, Modern Typography, vol. 2, London: Hyphen Press, 2010.
[4] E. C. Berkeley, “Secretaries Get a Computer of their Own to Automate
Typing, Computers and Automation,” in Computers and Automation, vol.
18, no. 1, pp. 59, 1969.
[5] J. B. Walther and K. P. D’Addario, “The impacts of emoticons on
message interpretation in computer-mediated communication,” Social
science computer review, vol. 19, no. 3, pp. 324-347, 2001.
[6] J. T. Hancock, C. Landrigan, and C. Silver, “Expressing emotion in text-
based communication,” in Proceedings of the SIGCHI conference on
human factors in computing systems. ACM, pp. 929-932, 2007.
[7] P. Shaw, “Codex: The Journal of Letterforms,” in The Menhart Issue.
John Boardley, 2012.
[8] J. Lee, S. Jun, J. Forlizzi, and S. E. Hudson, “Using Kinetic Typography
To Convey Emotion In Text-Based Interpersonal Communication,” in
The 6th conference on Designing Interactive systems. ACM, pp. 41-49,
[9] R. Rashid, Q. Vy, R. Hunt, and D. I. Fels, “Dancing with Words: Using
Animated Text for Captioning,” Intl. Journal of Human Computer
Interaction, vol. 24, no. 5, pp. 505-519, 2008.
[10] M. Wölfel and A. Stitz, “Responsive Type—Exploring Possibilities of
Self-Adjusting Graphic Characters,” in Proceedings of Cyberworlds
2015, 2015.
[11] M. Perloff and C. Dworkin, The sound poetry / the poetry of sound,
University of Chicago Press, 2009.
[12] R. Huelsenbeck, Dada Almanach, Berlin, Erich Reiss Verlag, pp. 53,
[13] E. Adamowicz and E. Robertson, DaDa and Beyond, vol. 1, DaDa
Discourses, Editions Rodopi B.V., Amsterdam New York, pp. 42, 2011.
[14] L. L. Rezabek and J. J. Cochenour, “Visual Cues in Computer- Mediated
Communication: Supplementing Text with Emoticons,” vol. 18, Journal
of Visual Literacy, no. 2, 1998.
[15] B. Mesquita, and R. Walker, Cultural differences in emotions: A context
for interpreting emotional experiences, Behaviour Research and Therapy,
41(7), pp. 777-793, 2003.
[16] S. Matsumiya, S. Sakti, G. Neubig, T. Toda, and S. Nakamura, “Data-
Driven Generation of Text Balloons based on Linguistic and Acoustic
Features of a Comics-Anime Corpus,” in Fifteenth Annual Conference of
the International Speech Communication Association, 2014.
[17] R. Zimmermann, “Use Of Multimedia Data For Emoticons In Instant
Messaging,” Jul. 28 2005, US Patent App. 10/767,132. [Online].
[18] T. Kisler, F. Schiel, and H. Sloetjes, “Signal Processing Via Web
Services: The Use CaseWebMAUS,” in Digital Humanities 2012,
Hamburg, Germany, pp. 30-34, 2012.
[19] S. Jiampojamarn, G. Kondrak, and T. Sherif, “Applying Manyto- Many
Alignments and Hidden Markov Models to Letter-to- Phoneme
Conversion,” in Human Language Technologies 2007: The Conference of
the North American Chapter of the Association for Computational
Linguistics. Rochester, New York: Association for Computational
Linguistics, pp. 372-379, April 2007.
[20] T. Schlippe, S. Ochs, and T. Schultz, “Grapheme-to-Phoneme Model
Generation for Indo-European Languages,” in the 37th International
Conference on Acoustics, Speech, and Signal Processing (ICASSP 2012),
Kyoto, Japan, 25-30 March 2012.
[21] M. Goudi and P. Nocera. Sounds and Symbols: An Overview of
Different Types of Methods Dealing With Letter-To-Sound Relationships
In A Wide Range Of Languages In Automatic Speech Recognition, in
Proceedings of SLTU 2014, St. Petersburg, Russia, 14-16 May 2014.
[22] A. W. Black, K. Lenzo and V. Pagel, “Issues in Building General Letter
to Sound Rules,” ESCA Workshop on Speech Synthesis, 1998.
[23] R. Poulin, Graphic Design and Architecture, A 20th Century History: A
Guide to Type, Image, Symbol, and Visual Storytelling in the Modern
World. Rockport Publishers, pp. 80, 2012.
[24] R. Bringhurst, The Elements of Typographic Style, vol. 3.2, Hartley and
Marks Publishers, pp. 55-56, 2008.
[25] D. B. Updike, Printing types, their history, forms and use, a study in
survivals, Geoffrey Cumberlege, vol. 2, Oxford University Press, London,
[26] M. Pell, M. Paulmann, S. Dara, A. Alasseri, and S. Kotzb, Factors in the
recognition of vocally expressed emotions: a comparison of our
languages, J Phon, vol. 37, pp. 417-435, 2009.
[27] O. M. Bachorowski J, Vocal expression of emotion: acoustic properties
of speech are associated with emotional intensity and context, vol. 6,
Psychol Sci, pp. 219-224, 1995.
[28] C. Williams and K. Stevens, Emotions and speech: some acoustical
correlates, J Acoust Soc Am, vol. 52, pp. 1238-1250, 1972.
[29] E. Rodero, “Intonation and emotion: Influence of pitch levels and contour
type on creating emotions,” vol. 25, no. 1, Journal of Voice, pp. 25-34,
[30] Caslon & Catherwood’s, Type specimen, 1821.
[31] N. Gray, Nineteenth Century Ornamented Typefaces, Faber & Faber Ltd,
London, 1938.
[32] J. H. Benson and Carey, Arthur Graham, The elements of lettering,
Newport, Rhode Island: John Stevens, 1940.
[33] G. Unger, Wie man’s liest, Sulgen, Zürich: Niggli Verlag AG, pp. 63-65,
[34] S. Phillips, K. Kelly, L. Symes, Assessment of Learners with Dyslexic-
Type Difficulties, SAGE, pp. 7, 2013.
[35] W. E. Tunmer & C.M. Fletcher, The Relationship between Conceptual
Tempo, Phonological Awareness, and Word Recognition in Beginning
Readers, Journal of Literacy Research, vol. 13, no. 2, pp. 173-185, 1981.
[36] J. B. Gleason, e-Study Guide for: The Development of Language, Content
Technologies, 2012.
[37] C. Schnitzler, Phonologische Bewusstheit und Schriftspracherwerb,
Georg Thieme Verlag, Stuttgart, pp. 1, 2008.
[38] “Der Erlkönig,” 2014, Erlkönig.
[39] M. Wölfel and J. McDonough, Distant Speech Recognition, John Wiley
& Sons, 2009.
[40] R. D. Clarke, “An application of the poisson distribution,” vol. 72, Journal
of the Institute of Actuaries, p. 481, 1946.
[41] K. Conrad, Die beginnende Schizophrenie. Versuch einer Gestaltanalyse
des Wahns, Stuttgart: Georg Thieme Verlag, 1958.
[42] M. Wölfel, A. Stitz, and T. Schlippe, „Voice Driven Type Design,
GLOBALE: Infosphere, ZKM (Center for Art and Media), Karlsruhe,
Germany, 2015-2016.
[43] M. Wölfel, A. Stitz, and T. Schlippe, „A Voice Driven Type Design
Demo,” in Proceedings of Mensch und Computer 2015, pp. 413-416,
Stuttgart, Germany, 2015.
[44] A. Waibel, H. Soltau, T. Schultz, T. Schaaf, and F. Metze, “Multilingual
Speech Recognition,” Verbmobil: Foundations of Speech-to-Speech
Translation, ed. Wolfgang Wahlster, Springer Verlag, 2000.

Supplementary resource (1)

September 2017
Matthias Wölfel · Tim Schlippe · Angelo Stitz
... Additionally, the authors approximated the Latin alphabet with a phonetic one via the use of ligatures that merged glyphs when multiple letters would represent only one sound (e.g., th). W€ olfel et al. [39] created a dynamic font-shaping engine named Voice-Driven Type Design. It allows for the modulation of a custom font's visual attributes (vertical and horizontal stroke thickness and letter width) to echo changes in prosody, which they explored for closed captions, text messaging, and expressive visualizations of poetry. ...
... They worked with traditional fonts, modified by scripts run in the Adobe InDesign software using the extracted prosody from audio files. As did W€ olfel et al. [39], both approaches were empirically evaluated, with positive recognition outcomesalthough the reliance on self-reporting of the former and exaggeratedly differentiated sound files of the latter casts some uncertainty over the results, which we hope to address in our own research. ...
... This process creates a visual depiction of prosody through changes in letter shapes, which we have named speech-modulated typography. We do so not through a custom type-shaping engine, as did [38], [39], [40], but, rather, using algorithmic manipulations of typographic features available in most modern operating systems and web browsers [45], [46], similarly to [47]. An overview of this process can be seen in Fig. 1. ...
Whether a word was bawled, whispered, or yelped, captions will typically represent it in the same way. If they are your only way to access what is being said, subjective nuances expressed in the voice will be lost. Since so much of communication is carried by these nuances, we posit that if captions are to be used as an accurate representation of speech, embedding visual representations of paralinguistic qualities into captions could help readers use them to better understand speech beyond its mere textual content. This paper presents a model for processing vocal prosody (its loudness, pitch, and duration) and mapping it into visual dimensions of typography (respectively, font-weight, baseline shift, and letter-spacing), creating a visual representation of these lost vocal subtleties that can be embedded directly into the typographical form of text. An evaluation was carried out where participants were exposed to this speech-modulated typography and asked to match it to its originating audio, presented between similar alternatives. Participants (n=117) were able to correctly identify the original audios with an average accuracy of 65%, with no significant difference when showing them modulations as animated or static text. Additionally, participants’ comments showed their mental models of speech-modulated typography varied widely.
... Due to their static nature, they need to have a higher expressiveness to demonstrate the characteristics of a given sound. Voice driven type design (VDTD) [23] is a system that maps speech characteristics on text formatting. The vertical and horizontal stroke weight and character width are influenced by loudness, pitch, and speed. ...
Full-text available
Many authors consider typography as what language looks like. Over time, designers explored connections between type design and sound, trying to bridge the gap between the two areas. This paper describes SpeechTyper, an ongoing system that generates typographic compositions based on speech. Our goal is to create typographic representations that convey aspects of oral communication expressively. The system takes a pre-processed analysis of speech recordings and uses it to affect the glyph design of the recited words. The glyphs’ structure is generated using a system we developed previously that extracts skeletons from existing typefaces.KeywordsType designTypographySpeechSpeech-drivenSound
Conference Paper
Speech is expressive in ways that caption text does not capture, with emotion or emphasis information not conveyed. We interviewed eight Deaf and Hard-of-Hearing (DHH) individuals to understand if and how captions’ inexpressiveness impacts them in online meetings with hearing peers. Automatically captioned speech, we found, lacks affective depth, lending it a hard-to-parse ambiguity and general dullness. Interviewees regularly feel excluded, which some understand is an inherent quality of these types of meetings rather than a consequence of current caption text design. Next, we developed three novel captioning models that depicted, beyond words, features from prosody, emotions, and a mix of both. In an empirical study, 16 DHH participants compared these models with conventional captions. The emotion-based model outperformed traditional captions in depicting emotions and emphasis, with only a moderate loss in legibility, suggesting its potential as a more inclusive design for captions.
This edited book is a collection of selected research papers presented at the 2022 3rd International Conference on Artificial Intelligence in Education Technology (AIET 2022), held in Wuhan, China, on July 1–3, 2022. AIET establishes a platform for AI in education researchers to present research, exchange innovative ideas, propose new models, as well as demonstrate advanced methodologies and novel systems. The book is divided into five main sections – 1) AI in Education in the Post-COVID New Norm, 2) Emerging AI Technologies, Methods, Systems and Infrastructure, 3) Innovative Practices of Teaching and Assessment Driven by AI and Education Technologies, 4) Curriculum, Teacher Professional Development and Policy for AI in Education, and 5) Issues and Discussions on AI In Education and Future Development. Through these sections, the book provides a comprehensive picture of the current status, emerging trends, innovations, theory, applications, challenges and opportunities of current AI in education research. This timely publication is well aligned with UNESCO’s Beijing Consensus on Artificial Intelligence (AI) and Education. It is committed to exploring how AI may play a role in bringing more innovative practices, transforming education in the post-pandemic new norm and triggering an exponential leap toward the achievement of the Education 2030 Agenda. Providing broad coverage of recent technology-driven advances and addressing a number of learning-centric themes, the book is an informative and useful resource for researchers, practitioners, education leaders and policy-makers who are involved or interested in AI and education.
More and more educational institutions are making lecture videos available online. Since 100+ empirical studies document that captioning a video improves comprehension of, attention to, and memory for the video [1], it makes sense to provide those lecture videos with captions. However, studies also show that the words themselves contribute only 7% and how we say those words with our tone, intonation, and verbal pace contributes 38% to making messages clear in human communication [2]. Consequently, in this paper, we address the question of whether an AI-based visualization of voice characteristics in captions helps students further improve the watching and learning experience in lecture videos. For the AI-based visualization of the speaker’s voice characteristics in the captions we use the WaveFont technology [3–5], which processes the voice signal and intuitively displays loudness, speed and pauses in the subtitle font. In our survey of 48 students, it could be shown that in all surveyed categories—visualization of voice characteristics, understanding the content, following the content, linguistic understanding, and identifying important words—always a significant majority of the participants prefers the WaveFont captions to watch lecture videos.
Full-text available
Faced with the challenge of online teaching-learning, university teachers continued with the responsibility of developing their learning sessions, innovating teaching material and methodology during this process, changing the way of generating learning in health sciences students, through the application of videos, summary readings and practices carried out with family members who acted as patients, in order to achieve the planned competition. The importance of letting students know their achievements in relation to what is evaluated, helps them to understand their way of learning, assess their learning result and self-regulate. This is how feedback motivates the student to rethink their learning strategies. The purpose of this study was to determine the effect of feedback on the online learning outcome of health sciences university students, in a non-experimental research, descriptive-correlational level, with a sample of 294 students. The results obtained showed that feedback in university students of Health Sciences in virtual environments is effective when applied in a timely manner and can be planned, based on the evidence of the learning outcome. To achieve this, they must be previously trained, from the first semesters of study, in feedback literacy, making it part of the self-regulation of their learning.
Full-text available
Emoticons are graphic representations of facial expressions that many e-mail users embed in their messages. These symbols are widely known and commonly recognized among computer-mediated communication (CMC) users, and they are described by most observers as substituting for the nonverbal cues that are missing from CMC in comparison to face-to-face communication. Their empirical impacts, however, are undocumented. An experiment sought to determine the effects of three common emoticons on message interpretations. Hypotheses drawn from literature on nonverbal communication reflect several plausible relationships between emoticons and verbal messages. The results indicate that emoticons' contributions were outweighed by verbal content, but a negativity effect appeared such that any negative message aspect-verbal or graphic-shifts message interpretation in the direction of the negative element.
Conference Paper
Full-text available
With voice driven type design (VDTD), we introduce a novel concept to present written information in the digital age. While the shape of a single typographic character has been treated as an unchangeable property until today, we present an innovative method to adjust the shape of each single character according to particular acoustic features in the spoken reference. Thereby, we allow to keep some individuality and to gain additional value in written text, which offers different applications – providing meta-information in subtitles and chats, supporting deaf and hearing impaired people, illustrating intonation and accentuation in books for language learners – up to artistic expression. In this paper we describe the demo system as demonstrated at the conference Mensch und Computer 2015.
Conference Paper
Full-text available
Mapping a language's graphemes to a sequence of symbols, which represent its corresponding phonemes, is of great importance for the recognition accuracy of an automatic speech recognition system. Phoneme-based and grapheme-based approaches are mainly employed for the creation of a pronunciation dictionary. In this paper, we present the application of these approaches on a variety of languages with different types of writing systems and various degrees of complexity between graphemes and phonemes.
Conference Paper
Full-text available
In this paper, we evaluate grapheme-to-phoneme (g2p) models among languages and of different quality. We created g2p models for Indo-European languages with word-pronunciation pairs from the GlobalPhone project and from Wiktionary [1]. Then we checked their quality in terms of consistency and complexity as well as their impact on Czech, English, French, Spanish, Polish, and German ASR. While the GlobalPhone dictionaries were manually cross-checked and have been used successfully in LVCSR, Wiktionary pronunciations have been provided by the Internet community and can be used to rapidely and economically create pronunciation dictionaries for new languages and domains.
Die vorliegende Arbeit beschäftigt sich mit der Fragestellung, inwieweit die Integrität der phonologischen Sprachverarbeitung für den erfolgreichen Schriftspracherwerb bei deutschsprachigen Kindern relevant ist. Hierbei bilden Fähigkeiten zur phonologischen Bewusstheit (PhB) den Schwerpunkt. Der erfolgreiche Schriftspracherwerb ist nicht nur für den Bildungserfolg mit den damit verbundenen beruflichen und sozio‑ökonomischen Perspektiven wichtig, sondern auch für die aktive Teilhabe am sozialen und kulturellen Leben in unserer Gesellschaft. Die Bestandteile dieser publikationsbasierten Dissertation sind eine Monographie (Schnitzler, 2008), ein Beitrag in einem Sammelwerk (Schnitzler, 2013) sowie zwei Zeitschriftenartikel (Schnitzler, 2014, 2015). Die ersten beiden Publikationen beschäftigen sich mit der Entwicklung der PhB sowie Zusammenhängen zwischen PhB und Schriftsprachfertigkeiten. Die beiden Zeitschriftenartikel beschäftigen sich mit dem LRS‑Risiko deutschsprachiger Kinder, die im Vorschulalter aufgrund phonologischer Aussprachestörungen (PhAS) logopädisch behandelt werden. Hierzu wurden in Schnitzler (2015) die Ergebnisse einer selbst durchgeführten Studie dargestellt. In dieser Studie wurden mögliche Einflüsse zusätzlicher nicht‑phonologischer Symptome und der Art der phonologischen Aussprachestörung kontrolliert. Die Ergebnisse weisen darauf hin, dass zum Schulbeginn und während der Schuleingangsphase genau beobachtet werden sollte, ob Kinder über altersentsprechende Fähigkeiten zur PhB verfügen und ob sie diese segmental‑phonologischen Wissensbestände bewusst aktivieren und beim Lesen und Schreiben effizient nutzen. Dies gilt insbesondere für Kinder, für die ein erhöhtes LRS‑Risiko besteht. Verfügen Kinder zu dieser Zeit über unzureichend spezifizierte phonologische Repräsentationen, ist eine frühzeitige Intervention im Sinne einer Prävention von LRS angezeigt.
Most automatic speech recognition systems existing today are still limited to recognizing what is being said, without being concerned with how it is being said. On the other hand, research on emotion recognition from speech has recently gained con- siderable interest, but how those emotions could be expressed in text-based communication has not been widely investigated. Our long-term goal is to construct expressive speech-to-text systems that conveys all information from acoustic speech, in- cluding verbal message, emotional state, speaker condition, and background noise, into unified text-based communication. In this preliminary study, we start with developing a system that can convey emotional speech into text-based communication by way of text balloons. As there exist many possible ways to generate the text balloons, we propose to utilize linguistic and acoustic features based on comic books and anime films. Ex- perimental results reveal that expressive text is more preferable than static text, and the system is able to estimate the shape of text balloons with 87.01% accuracy.