ChapterPDF Available

Text-to-speech audio description of voiced-over films. A case study of audio described Volver in Polish

text-to-speech audio description of voiced-over films
Given that the production (esp. recording) of AD is quite costly, there are not very many
audio described films available on the Polish market. Moreover, there is practically
no audio description to foreign films in Poland since it has been assumed that blind
and partially sighted audiences will not manage to assimilate multiple soundtracks
(original soundtrack in foreign language, voiceover and audio description). In order
to overcome the cost hurdle, we propose text-to-speech audio description (TTS AD) as a
cheaper alternative to traditionally produced AD. We will demonstrate how TTS AD can
be combined with voice-over to produce AD to foreign films on the example of Volver by
Pedro Almodovar. We will also present the results of a survey conducted among a group of
blind and partially sighted audience after a screening of voiced-over Volver with TTS AD.
The results of the survey demonstrate that the participants are quite open to the idea of
TTS AD both as an interim solution – until there are more audio described films available
– and as a permanent solution.
Text-to-speech audio
description of voiced-over
films. A case study of audio
described Volver in Polish
Agnieszka Szarkowska
University of Warsaw, Poland
Anna Jankowska
Jagiellonian University, Krakow, Poland
Back in the 1990s, the map of audiovisual Europe was divided into dubbing,
subtitling and voice-over countries (Gottlieb 1998). With subtitles in the
cinemas and television voice-over for both fiction and non-fiction audiovisual
productions, Poland seems to be an exception in Europe’s audiovisual landscape.
Television voice-over is at the same time an advantage and a disadvantage for
the blind and partially sighted audience. On the one hand, unlike subtitles, voice-
over provides them with the translation of the foreign dialogue. Unfortunately, at
the same time it has been assumed that due to multiple soundtracks (the original
soundtrack in a foreign language, the voice-over and the audio description), it
is pointless – if not impossible – to combine voiced-over foreign films with
audio description (AD). Therefore, there is practically no audio description
to foreign films in Poland. This stands in stark contrast with the number of
foreign language films screened in cinemas, broadcast on TV or released on DVD/
Blue-ray. Last but not least, as declared on many occasions, blind and partially
sighted people, just like a sighted audience, want to watch foreign films.
In order to audio describe a foreign film, one needs to combine the AD script
with either audio subtitling or voice-over. Since an overwhelming majority
of films in Poland are voiced-over, in our paper we will present how this AVT
modality can be combined with audio description. Given that the production
(esp. recording) of AD is quite costly, there are not very many audio described
films available on the Polish market. In order to overcome the cost hurdle, we
propose text-to-speech audio description (TTS AD) as a cheaper alternative to
traditionally produced AD. We will demonstrate how TTS AD can be combined
with voice-over to produce AD for foreign films on the example of Volver by Pedro
Almodóvar. Finally, we will also present the results of a survey conducted among
a group of blind and partially sighted audience after a screening of voiced-
over Volver with TTS AD.
Polish audiovisual landscape
Until recently, Europe has traditionally been divided into dubbing, subtitling and
voice-over countries (Gottlieb 1998). However, the recent findings of the Media
Consulting Group (2007) show that this classical division is a simplification of a
far more complex situation. We no longer can, if we ever could, talk about strictly
dubbing, subtitling or voice-over countries since most of them employ all three
methods in various contexts.
With dubbing, subtitling and voice-over existing side by side, Poland seems
to be an excellent example of this complexity. The most popular AVT modality in
Polish cinemas is subtitling. For young audience, however, dubbing is preferred.
Nevertheless, movies considered to be suitable for audiences of any age (e.g.
text-to-speech audio description of voiced-over films
Alice in Wonderland, Harry Potter, Shrek) are often made available both dubbed
and subtitled. Voice-over is without any doubt the predominant modality in
television broadcasts. It is used for both fiction and non-fiction audiovisual
products, the only exception being animated and non-animated productions for
children and some animated movies for the general audience (e.g. Shrek). The
choice, however, is not always obvious – the BBC television series The Chronicles of
Narnia was broadcast with voice-over, whereas the motion picture The Chronicles
of Narnia by Walt Disney Pictures was dubbed. DVDs and Blu-ray usually contain
both voice-over and subtitles, or dubbing and subtitles.
Such an audiovisual landscape influences many areas – one of them is
accessibility for the blind and partially sighted. While cinema with subtitles is
virtually inaccessible to audiences with visual impairments, one could argue
that television voice-over has the advantage of at least providing them with the
translation of the foreign dialogue. In the long run, however, this seems to be a
drawback rather than an advantage for the blind and partially sighted. Currently
there is no audio description on Polish television mostly because the digital turn
is yet to come.2 Unfortunately, the chances of making television fully accessible
for the visually impaired are scarce since it has been assumed that it is impossible
to provide audio description for voiced-over programs. In a country with a high
import of foreign television programmes, where most if not all – of them are
voiced-over, this virtually means that television will not be accessible, or it will be
accessible in a very limited way, via audio described programs produced in Polish.
Audio description in Poland – development and challenges
Though to many it may come as a surprise, audio description was launched in
Poland already in the late 1990s. As described by Jankowska (2008: 242), the
project was initiated by Andrzej Woch, a blind employee of the Jagiellonian
University in Krakow, and funded by the Central Library of the Polish Association
of the Blind (Polski Zwi
zek Niewidomych, PZN). The so-called typhlo-films involved
providing additional commentary for twenty films on VHS tapes, making them
available to the blind and partially sighted in the Central Library. As opposed
to today’s audio description, freezing the image whenever the additional
commentary did not fit in the gap between the dialogues was a common practice
in typhlo-films. It is highly probable that due to this feature the idea of describing
films for the blind and partially sighted did not manage to force its way to broad
social consciousness until some years later. Nevertheless, it should be noted that
within the typhlo-film project additional commentary for the blind was provided
also for foreign voiced-over films. This, until recently, was the only attempt at
combining audio description with voiced-over films.
The first public audio described cinema screening, which took place on 27
November 2006 in Białystok, can be regarded either as a revival or as the true
beginning of audio description in Poland. The audio description for this first
screening was read live by an audio describer who was sitting with the audience
and reading the script with a microphone in his hand, so that AD could be heard
by all the spectators. Ever since, the only more or less regular screenings of audio
described films take place in Warsaw as a part of the project Cinema beyond silence
and darkness carried out by the Foundation for Children Help on Time (Fundacja
Dzieciom Zd
z Pomoc
).1 Irregular screenings take place in other larger cities,
but they are rather one-off events very often organized by enthusiasts and
unfortunately almost as often by amateurs.
Until very recently audio described programs have been available only online.
Polish public television (TVP) audio described some of its programs (about 74
hours) and made them available online on its website. The audio described
programs include two feature films, five TV series and one series for children.
In order to access audio described films, one needs to receive a special password
which can be obtained from the Polish Association of the Blind – free of charge
for its members or for a fee for non-members. As to DVDs, at the moment of
writing this article there are only 12 discs with AD available on the Polish market.
This situation will hopefully improve since on 1 July 2011 a new law obliging
all broadcasters to provide AD and SDH came to life. Unfortunately at this point
it is really hard to predict when and if the broadcasters will provide accessibility
services. The law stipulates that 5% of quarterly broadcast (commercials and
tv-shopping excluded) in 2011 and starting from 2012 about 10% of quarterly
broadcast (10% relates to AD, SDH and sign language interpreting taken
together)3.. It also allows the broadcasters to apply to the National Broadcasting
Council for individual permission to lower the percentage of accessibility
services. In the case of not providing the services, the law provides for a fine
up to 10% of yearly income. From what can be observed now, broadcasters are
becoming interested in providing accessibility services. Although because of
financial reasons they are more willing to provide SDH than AD due to their
programmes. At the time of writing this article, the only broadcaster to provide
AD was TVP – however it should be mentioned that they screened one of the
series that has been made available on-line since at least 2008.
It should be stressed that – apart from typhlo-films all the above mentioned
films and TV series were either originally filmed in Polish or were dubbed into
Polish. As we have already explained, it has been assumed that due to multiple
soundtracks (the original soundtrack in a foreign language, the voice-over/audio
subtitling and the audio description), audio description cannot be combined
with foreign films screened with voice-over or audio subtitles. As a result, for the
time being there are no audio described foreign language films, which is espe-
cially striking when compared with the amount of foreign language productions
present on TV or released on DVD/Blu-ray. Above all, it also collides with the pref-
erences and capabilities of the blind and partially sighted people who, as declared
on many occasions, want to watch foreign films just like the sighted audience.
text-to-speech audio description of voiced-over films
This is also confirmed by our findings from a pilot study conducted prior to
launching the TTS AD project. In February 2010 we asked a group of 17 pupils,
aged 12 to 18, from the Special Educational Centre for Blind and Partially Sighted
Children in Krakow, about their television viewing preferences as well as about
their opinion on combining audio description with voiced-over/audio subtitled
foreign programs. When asked about the audio described films and series they
would like to watch, 53% preferred foreign productions, 29% opted for Polish
and 18% did not prefer either one of the options. The views on combining audio
description with voice-over or audio subtitling were especially interesting, as
87% of the interviewed pupils declared that multiple soundtracks would not be
an obstacle to film enjoyment.
Unfortunately, reluctance towards providing audio description to foreign
films is not the only challenge that needs to be faced in Poland. After the initial en-
thusiasm, the lack of financing inhibits the implementation of audio description.
Previous studies on the use of synthetic speech by the visually impaired
The invention of synthetic speech has been an important milestone in the
everyday lives of many visually impaired people, who can now benefit from
a host of text-to-speech (TTS) applications both in their work and for leisure
activities. There are numerous applications harnessing the power of text-to-
speech systems for people with visual impairments: from GPS-based mobility
aids, screen reading software for web browsing, email, etc., educational tools,
such as TTS dictionaries and textbooks, to entertainment, for instance audio
subtitles in audiovisual materials (see Freitas and Kouroupetroglou 2008 in
Cryer and Home 2008: 5).
Synthetic speech enables visually impaired people to access information
without relying on other people reading it to them or waiting for it to be brailled
(Garcia 2004; Llisterri, Fernàndez, Gudayol, Poyatos and Martí 1993), thus
allowing for more independence. This is particularly important in the context of
receiving financial information (Thompson, Reeves and Masters 1999).
It is thanks to a relatively low cost of synthetic speech that the number of
materials made accessible to the blind and partially sighted people is on the
rise. A good example is the RNIB service known as Talking Books.4 RNIB research
on user attitudes towards synthetic speech in Talking Books reveals that while
most users prefer a human narrator for leisure reading, they felt synthetic voice
“would be acceptable for reference, instructional and non-fiction books” (Cryer
and Home 2009: 5). It has also been found that the attitude towards synthetic
speech varied greatly and largely depending on previous experience with
TTS applications. Many users stressed that – as opposed to a human narrator
– synthetic speech allows them to choose their own accent-free voices with
neutral emphasis; this, in turn, makes it possible for users to “add their own
interpretation” (Cryer and Home 2009: 6).
Synthetic speech has also been used to read daily newspapers to blind and
partially sighted people, as reported in a study by Hjelmquist, Jansson and Torrell
(1990). The four-month study revealed that “whilst initially many users had
doubts about the system, and found synthesised speech difficult to understand,
all reported ‘getting used to it’ after a few hours of listening” (cited in Cryer and
Home 2008: 6). In fact, the more experience users have with synthetic speech,
the better their comprehension performance is (Rhyne 1982; Venkatagiri 1994).
Apart from previous experience with synthetic speech, other factors
that may influence comprehension of text-to-speech messages include the
presentation rate (Hjelmquist, Dahlstrand and Hedelin 1992; Koul 2003), the
voice intelligibility (Papadopoulos et al. 2009) and the presence of background
noise (Koul and Allen 1993). Last but not least, it is the quality of synthetic speech
especially its intelligibility and naturalness that plays a crucial role both in
comprehension and attitude towards text-to-speech systems. It needs to be noted
that some of the studies cited here were conducted several years ago and that the
quality of synthetic speech has improved considerably since then.
Text-to-speech audio description
Over the past few years audio description became fashionable. Sadly, despite
the growing enthusiasm and demand for audio description, the number of
audiovisual products accessible to the visually impaired is hardly sufficient.
Text-to-speech audio description (TTS AD) was proposed as a possible option
to increase the output of audio described programs through cost reduction. As
noted by Szarkowska (2011: 145) “TTS AD offers unequalled cost-effectiveness in
terms of AD production in comparison with conventional methods of producing
audio description as it does not require the recording of the AD script (for pre-
recorded AD) nor does it incur any human labour costs for the reading out of
the AD script (for live AD)”. In fact, the human labour costs mentioned above are
reduced to the cost of creating and adapting the audio description script to the
demand of TTS AD. Apart from all the pitfalls of traditional AD script writing, the
TTS AD process is not excessively complicated (see Szarkowska 2011 for details).
In order to create text-to-speech audio description, first an AD script is written
and then, using subtitling software, it is synchronised with the film. Put simply,
synchronizing the AD script with a film means preparing a list of AD chunks in
the form of subtitles, each consisting of text and time codes. From the technical
point of view, the process differs from preparing the traditional subtitles only
by the fact that the text is to appear in the gaps between the dialogues and not
simultaneously with them. Later on, the text file is read by speech synthesis
software while the film is played on a multimedia player, or the AD script can be
recorded and mixed with the multimedia file.5
text-to-speech audio description of voiced-over films
The TTS AD project
The scarcity of audio description concerns not only the amount but also the range
of audiovisual products. On the one hand, due to the costly production process,
there are simply not enough audio described products. On the other hand, the
reluctance towards audio describing films with voice-over/audio subtitling
results in a lack of foreign films with audio description.
The goal of the TTS AD project is to research the possibility of increasing
both the number and the range of audio described films through investigating
the feasibility of combining TS AD with audiovisual programs broadcast with
dubbing, subtitling and voice-over as well as programs filmed originally in Polish.
The TTS AD project is being developed at two Polish universities the
University of Warsaw and the Jagiellonian University of Krakow. The five-stage
study includes the implementation of TTS AD in:
1) a Polish feature film (see Szarkowska 2011),
2) a foreign feature film with Polish dubbing (unpublished MA thesis and
PhD dissertation)
3) a foreign feature film in English combined with voice-over in Polish
(discussed in this paper)
4) a foreign feature film in English combined with audio subtitles in Polish
(work in progress),
5) a documentary in English combined with voice-over in Polish (work in
It is our contention that the ultimate purpose of audio description is to serve the
blind and the partially sighted community, therefore reception studies are the
core of our research. In order to find out the opinion of the blind and partially
sighted viewers, we conduct a survey after every screening. One of the most
important research issues is whether visually impaired viewers would accept
TTS AD either as a permanent or interim solution. Having completed three out
of the five stages, we can say that so far TTS AD has proved to be an acceptable
solution for the visually impaired community, as 94% of our viewers accept TTS
AD as an interim solution and 63% are willing to accept it as permanent solution.
The film selected for the purposes of this research was Volver, a 2006 Spanish
drama directed by Pedro Almodóvar. This is how Peter Bradshaw described it in
his Guardian review:
Volver (in English, Coming Home or Coming Back) is a gripping melodrama inspired by
the trash TV that is a soundtrack to its characters’ lives. Penélope Cruz is Raimunda,
a hard-working woman with a teenage daughter, Paula, and a feckless, layabout hu-
sband. [...] Raimunda’s family life shatters with one terrible act of violence, and there
is a secret about her late mother Irene that surfaces when Irene returns from beyond
the grave to make contact with her astonished daughters.
The film rendered itself quite well for audio description. It nevertheless
presented the audio describer with a number of problematic issues, such as the
vivid imagery of picture composition, insufficient time to convey important
information, culture specificity of certain elements presented on screen (food,
places or signs in Spanish) and the intensity of emotions expressed in the
characters’ faces, especially in close-ups. Given that several scenes featured at
least two female characters, whose original voice was covered by voice-over,
the issue of speaker identification was particularly pertinent. In order to solve
this problem, the names of the characters were incorporated in the AD script
whenever it was deemed necessary.
As the film was already available on DVD with Polish voice-over translation, it
was decided that this version would be used in the study, complemented with the AD
script read by text-to-speech software. For the project, the speech synthesiser Ivona
(Ivo Software) was used together with the synthetic voice Krzysztof (Loquendo).
Audio subtitling vs. voice-over
Volver was one of the few foreign (i.e. non-English speaking) films audio
described in the UK and released on DVD, so the AD script had to be accompanied
by a translation of dialogues, which was done through audio subtitles read out by
a female narrator. The choice of the female voice most probably stemmed from
the nature of the film, where women play the most important characters. The
AD script, in contrast, was read by a male voice talent. This solution enabled the
audience not to confuse the AD script with the dialogues. However, the presence
of one female voice for all the (mainly female) characters and the poor quality of
the recording of audio subtitles, which drowned out the original Spanish voices so
that they were hardly audible, resulted in viewers having difficulties recognizing
which character was speaking as many scenes in the film feature a few women
talking. As a result, the overall quality of the AD was perceived as poor and the
audio described film met with fierce criticism from the British visually impaired
community (Leen Petre from RNIB – personal communication).
It is worth noting at this point that the British audience is not used to hearing
a translation of a film being read out to them on top of the original voices. Polish
viewers, in contrast, have had many years of experience of listening to the
voice-over translation of film dialogues on television, which makes them more
accustomed to this AVT modality.
Audio subtitling and voice-over seem to be two audiovisual translation
modalities which have a lot in common. First of all, they both consist of a
translation of the dialogue list of a foreign or multilingual film. Secondly, the
text-to-speech audio description of voiced-over films
translation is read out to the target audience – the main difference being that in
the case of voice-over, the target audience is simply conceived of as mainstream
sighted population, whereas in audio subtitling it comprises a much smaller
group of visually impaired people. Thirdly, the translation is usually read out
by one voice talent (typically male in Poland), while the voices of the original
actors can still be heard in the background though their volume has been turned
down. Polish voice-over is done in a professional recording studio, which usually
guarantees good sound quality. Finally, apart from the different target audiences
envisaged at the production stage, audio subtitling is created together with
the AD script and thus allows for some flexibility in combining the two tracks,
whereas in the case of Poland, AD would be added to a voiced-over film at a
later stage, which makes it virtually impossible to introduce changes to the pre-
recorded voice-over so that it can be seamlessly interwoven with the AD script.
Research questions
The key objective of the present study was to determine whether visually
impaired viewers would find it acceptable for text-to-speech software to read AD
scripts to voiced-over feature films. To address this objective, the following three
research questions were formulated:
1) Which AD voice would the visually impaired prefer if they had a choice
between a human voice and a synthetic voice?
2) Would TTS AD be acceptable as an interim solution, until a system has
been agreed to have a human voice reading out the AD?
3) Would TTS AD be acceptable as a permanent solution, next to AD read by a
human voice?
The screening took place at an informal meeting for blind and partially sighted
people organised by the Foundation Chance for the Blind (Fundacja Szansa dla
Niewidomych) in Jachranka, near Warsaw, on 24 April 2010.
The audience was first invited to watch the film and after the projection they
were asked to answer 13 questions, which were read out by sighted volunteers.
The first part of the questionnaire was meant to establish participants’ age,
education, degree and type of sight loss (congenital/acquired). The second
part aimed to find out their views on the use of speech synthesis in AD, their
previous experience with audio description and their familiarity with text-to-
speech software.
After the screening, a total of 20 people were interviewed: 14 women (70%) and
6 men (30%). As shown in Table 1, five of them were blind (25%), 13 were partially
sighted (65%) and two of them (10%) were sighted.
Table 1. Participants by age and degree of sight loss
Age bracket Blind Partially sighted Sighted
18-25 - 3 -
26-39 2 7 1
40-59 3 1 -
60-74 - 2 1
Total 5 13 2
Most participants (12 people, 67%) had a congenital sight loss, while one in third
(8 people, 33%) acquired their sight loss at a later stage in life. Both the degree and
type of sight loss was determined based on self-declarations of the participants.
12 out of 20 participants (67%) said they use text-to-speech software regularly,
either at home or at work. Only 11 people (55%) had seen some films with audio
description before, while nine of them had no prior experience of AD.
When asked about what voice they would prefer to read AD scripts, half of
the participants (10 people, 50%) declared their preference for a human voice.
Perhaps somewhat surprisingly, one person preferred a synthetic voice to read
AD, whereas many others stated that this depends on the type of programme (6
people, 30%). Three participants (15%) were not sure and would have preferred to
have more experience with AD to make a more informed choice.
In terms of accepting TTS AD as either an interim or permanent solution,
most participants were in favour of both (see Table 2). Some expressed their
concerns whether the introduction of TTS AD would not result in eliminating
human voices and substituting them completely with synthetic voices.
text-to-speech audio description of voiced-over films
Table 2. The acceptance of TTS AD as an interim or permanent solution
Interim Permanent
Yes 95% 70%
No - 15%
Don’t know 5% 15%
Overall, all participants apart from one were in favour of introducing TTS AD as
an interim solution, especially if it meant more audio described programmes
accessible to people with visual impairments. The participants were slightly more
sceptical, however, about the introduction of TTS AD as a permanent solution:
while 70% of them support the idea, one in three is either against or unsure.
A closer examination of the preferences for TTS AD as an interim or permanent
solution based on the degree of sight loss has shown a slight tendency on the part
of blind participants to be more supportive of the idea (see Table 3).
Table 3. The acceptance of TTS AD as an interim or permanent solution by degree
of sight loss
TTS AD as interim TTS AD as permanent
yes no don’t know Yes No don’t know
Blind 100% - - 80% - 20%
Partially sighted 92% - 8% 70% 15% 15%
This may be due to the fact that partially sighted viewers can to some extent rely on
their eyesight to watch films, while blind viewers are more dependent on AD and
thus want more audio described films regardless of the voice reading the script.
In terms of gender, it is female participants who appear to be more inclined to
accept TTS AD than men (Table 4).
Table 4. The acceptance of TTS AD as an interim or permanent solution by gender
TTS AD as interim TTS AD as permanent
Gender yes no don’t know Yes no don’t know
Women 100% - - 79% 14% 7%
Men 83% - 17% 50% 16% 34%
There seem to be no clearly discernible correlations between the preference for
TTS AD as an interim or permanent solution in terms of participants’ age (Table 5).
Table 5. The acceptance of TTS AD as an interim or permanent solution by age
TTS AD as interim TTS AD as permanent
Age yes no don’t know Yes no don’t know
18-25 100% - - 67% 33% -
26-39 90% - 10% 60% 20% 20%
40-59 100% - - 75% - 25%
60-74 100% - - 100% - -
Interestingly, participants from elder age groups seem to be slightly more willing
to see TTS AD as a permanent solution than those from younger age groups in the
study. Naturally, the sample is too small to draw any further reaching conclusions.
As stated above, previous studies on synthetic speech revealed that the
experience and exposure to text-to-speech software may positively influence the
attitude towards it. This pattern seems to be confirmed in our study (Table 6).
Table 6. The acceptance of TTS AD as an interim or permanent solution by the use
of TTS software
TTS AD as interim TTS AD as permanent
yes no don’t know Yes no don’t know
TTS users 100% - - 75% 8% 17%
users 86% - 14% 57% 29% 14%
text-to-speech audio description of voiced-over films
TTS users are more likely to accept TTS AD both as an interim and as a permanent
solution when compared to those who have had no regular experience with
speech synthesis software. This pattern is more noticeable with regard to TTS AD
as a permanent solution.
A similar trend can be observed when it comes to the preference for either
human or text-to-speech narrator (Table 7).
Table 7. The preference for human/synthetic audio describers by the use of TTS
TTS users Not TTS users
Human 33% 64%
Synthetic - 12%
Depends on the
42% 12%
Don’t know 25% 12%
Total 12 people 8 people
In the study, people who do not habitually use text-to-speech software were
more likely to prefer human narrators, while regular TTS users were more open
to the idea that TTS AD may be a good solution for some types of programmes,
but not for all (42% stated the choice of the human/synthetic voice depended
on the programme). This issue will be pursued in further stages of our research
when we investigate the application of TTS AD in non-fiction genres, such as a
documentary and an educational programme.
Although we acknowledge that further research involving a wider range of films
and more varied viewers is needed all in all, the present study not only reveals a
considerable potential of text-to-speech audio description but also the acceptance
of visually impaired viewers of audio description for voiced-over films. Together
with the results of other parts of the project, it seems clear that TTS AD is
acceptable for the majority of the visually impaired viewers interviewed with a
surprisingly high number of people also accepting it as a permanent solution.
This work has been supported by the research grant ‘Text-to-speech audio
description’ no. N N104 148038 of the Polish Ministry of Science and Higher
Education for the years 2010-2011 and research grant ‘Audio description with text-
to-speech software’ from the Faculty of Philology of the Jagiellonian University
in Krakow.
Many thanks to Marzena Chrobak, the Polish translator of the Volver
screenplay published by the ZNAK publishing house, for letting us use several
fragments of her translation in the AD script. We also thank Joanna Pietrulewicz
and Agnieszka Walczak for their help in administering the questionnaire after
the screening. Last but not least, we owe our warmest thanks to Marek Kalbarczyk
for inviting us to Jachranka and allowing us to carry out the study.
text-to-speech audio description of voiced-over films
notes 1 Special screenings used to take
place twice a month, on selected
days. So far 21 films, with both
audio description and subtitles
for the deaf and hard of hearing
have been presented, however the
project was suspended for some
months due to lack of financing.
2 Things in Poland have just
changed and AD has been
introduced on Polish television
right after writing this article.
3 10% includes audio description,
subtitling for the hearing
impaired and sign language
interpreting taken together.
4 In 2009, the RNIB Talking
Books library had over 16,000
titles. According to Cryer and
Home (2009: 8), “it takes an
average of five days to record a
Talking Book, and around 100
new Talking Books are added
to the library each month”. It is
believed that synthetic speech
“may significantly reduce the time
required to produce new books,
which may mean more books
could be produced” (ibidem).
5 Initially it was impossible to
change the reading speed and
the volume, which were set at
the same level throughout the
film. Because of that at times
audio description was not audible
or it did not convey as much
information as needed. Now those
problems are solved thanks to
some advanced features of the text-
to-speech software which allows
the control and the adjustment of
the volume and the reading speed
to the on-screen action.
references Cryer, H. and Home, S. 2008.
“Exploring the Use of Synthetic
Speech by Blind and Partially
Sighted People.” RNIB Centre
for Accessible Information,
Birmingham: Literature review #2.
Cryer, H. and Home, S. 2009.
“User Attitudes towards Synthetic
Speech for Talking Books”. RNIB
Centre for Accessible Information,
Birmingham: Research report #7.
Bradshaw, P. 2006. “Volver.
The Guardian, 25 August 2006.
Available at:
Cryer, H. and Home, S. 2009.
“User Attitudes towards Synthetic
Speech for Talking Books.” RNIB
Centre for Accessible Information,
Birmingham: Research report #7.
Freitas, D. and Kouroupetroglou,
G. 2008. “Speech Technologies for
Blind and Low Vision Persons.
Technology and Disability 20, 135-156.
Garcia, L.G. 2004. “Assessment of
Text Reading Comprehension by
Spanish-Speaking Blind Persons.
British Journal of Visual Impairment
22 (1): 4-12.
Gottlieb, H. 1998. “Subtitling”. In
Routledge Encyclopedia of Translation
Studies, Baker M. (ed.), 244-248.
London and New York: Routledge.
Hjelmquist, E., Jansson, B. and
Torell, G. 1990. “Computer-
Oriented Technology for
Blind Readers.” Journal of Visual
Impairment and Blindness 17: 210-215.
Jankowska, A. 2008.
Audiodeskrypcja, szczytny cel w
tłumaczeniu.” Mdzy Oryginałem z
Przekładem (14): 225-246.
Media Consulting Group. 2007.
“Study on Dubbing and Subtitling
Needs and Practices in the
European Audiovisual Industry.
Final Report”, Media Consulting
Group, Paris / Peacefulfish,
Papadopoulos, K. et al. 2009.
“Perception of Synthetic and
Natural Speech by Adults with
Visual Impairments.Journal of
Visual Impairment and Blindness 103
(7): 403-414.
Rhyne, J.M. 1982. “Comprehension
of Synthetic Speech by Blind
Children.” Journal of Visual
Impairment and Blindness 10 (10):
Szarkowska, A. 2011 “Text-to-
Speech Audio Description. Towards
wider availability of AD.Journal of
Specialised Translation, 142-162.
Thompson, L., Reeves, C. and
Masters, K. 1999. “In the Balance:
Making Financial Information
Accessible.” British Journal of Visual
Impairment 17 (2): 65-70.
Venkatagiri, H. S. 1994. “Effect of
Sentence Length and Exposure on
the Intelligibility of Synthesised
Speech.Augmentative and Alternative
Communication 10: 96-104.
text-to-speech audio description of voiced-over films
Appendix 1
The questionnaire
1. Age
2. Gender
3. Are you
Partially sighted
4. Type of sight loss
5. Do you use text-to-speech software regularly at home or at work?
Don’t know
6. Have you seen any films with audio description?
Don’t know
7. If you had a choice, which AD voice would you prefer?
A human voice
A synthetic voice
Depends on the film/programme
Don’t know / doesn’t matter
8. Would you accept TTS AD as an interim solution, until a system has been
agreed to have a human voice reading out the AD?
Don’t know
9. Would you accept TTS AD as a permanent solution, as an alternative to a
human voice?
Don’t know
10. Did you like the fact that the AD script was read by a male voice – just as
the lektor’s voice?
Yes, I liked it
No, I’d prefer to have a female voice
Don’t know
11. Was it easy to decide who was speaking?
Yes, it was quite easy
No, it was quite difficult
Sometimes it was difficult
Don’t know
12. Did you like the fact that the AD script included fragments of Almodovar’s
screenplay and that AD was not only a description of what was going
on on the screen, but it also included extra information, such as the
description of the main characters at the beginning of the film?
Don’t know
13. What did you like and what did you not like in the AD you have just seen?
Do you have any comments?
... Od lat zagadnienie interpretacji w audiodeskrypcji jest kością niezgody (Szarkowska, Jankowska 2012). Wciąż toczy się debata, czy w audiodeskrypcji opisywać tylko mimikę, czy raczej nazwać emocje, które się przez nią manifestują. ...
... Po drugie Polska jest krajem o wysokim imporcie zagranicznych programów telewizyjnych, gdzie większość -jeśli nie wszystkie -zawierają tłu maczenie czytane przez lektora. Założono, że taka praktyka utrudnia wprowadzanie dodatkowego opisu audio jako kolejnej ścieżki dźwię kowej (Szarkowska, Jankowska 2012). Z tego powodu dostęp osób niewidomych i słabowidzących do filmów zagranicznych jest bardzo ograniczony, a tym samym wyklucza tę grupę z pełnego uczestnictwa w kulturze i dostępie do kinematografii. ...
... Zgodnie z doniesieniami z badań (Chmiel 2012;Szarkowska, Jankowska 2012;Chmiel, Mazur, 2014) w polskich salach kinowych często audiodeskrypcja odczytywana jest na żywo. Prezentacja ta może przybrać formę otwartą lub zamkniętą. ...
Full-text available
Book: “Visual education of blind people in the field of art” discusses the issues of access of the people with visual impairment and blindness to visual works of art and the strategies of visual literacy. This is an important issue because blind people, operating in the same social and educational space as the seers, cannot be excluded from the social discourse on visually perceived phenomena, and the information deficit resulting from the iconosphere should be filled on the basis of compensation mechanisms through good practices of making the sphere of visual representations available. The book is aimed at theoreticians and practitioners in the field of fine arts, pedagogy, special education, typhlology, philology and translation studies, interested in intersemiotic and intersensory translation of works of art, in order to support blind people in accessing them.
... Although most AD reception studies do inquire about the type of respondents' vision dysfunction, only some of them look specifically into the type of visual impairment as a factor potentially modulating preferences. For example, Szarkowska and Jankowska (2012) examined the reception of AD delivered through speech synthesis and found that persons with blindness were more likely to support the idea than those with low vision, though the differences were not vast: 80% and 70%, respectively, were willing to accept it as a permanent solution (Szarkowska & Jankowska, 2012, p. 91). compared how participants with blindness, low vision and without visual impairment responded to a film clip in three conditions: without AD, with standard AD and with cinematic AD (i.e., one which accounts for filmic language). ...
Audio description (AD) is a type of translation involving the transfer of images into words, whose primary target audience are persons with sight loss. However, in reality audio description users are a heterogeneous group with various dysfunctions ranging from congenital blindness to low vision. These users might differ in their preferences for audio description. In order to investigate this issue, we designed an AD reception study in which we conducted questionnaire-based in-depth interviews with 50 persons with congenital blindness, non-congenital blindness and low vision. The participants watched audio described film clips and answered questions about their preferences regarding general viewing habits, character naming and objectivity, describing facial expressions and colours, using similes and metaphors, explicitation and intertextual allusions. We predicted that the type of vision dysfunction would modulate those group preferences that may be shaped by the existence or lack of residual vision and visual memory, such as video-AD synchronisation or the use of explicitation and intertextual allusions. Our findings suggest that group differences are not as strong as expected. We conclude that AD should strive for middle-of-the-road solutions, at the same time allowing for alternative versions targeted at specific audiences.
... A simple dubbing technique that is commonly applied is known as a voice-over [5], [6], [7]. Another method that can be applied for dubbing is text-to-speech, as reported in the study [8], [9], [10], [11]. This technique is very efficient and does not require a voice source. ...
Full-text available
Dubbing is a term used to describe filling in the sound on film or video. Voice conversion can be done to support dubbing, for purposes such as obtaining a child’s voice for dubbing on children’s films. However, problems frequently occur with this process, including difficulty finding children’s voice resources and difficulty getting children to express the desired tone and mood while recording. Therefore, in this study, we propose a method for creating a cross-gender and age voice conversion from adult voices to children’s voices. The feature extraction method that is used is Linear Predictive Coding, and the modeling method is the Hidden Markov Model. The parts synthesized are fundamental frequency (F0) and spectral content. From the simulation test, the best results for the voice conversion are achieved by Linear Predictive Coding order 19. The best state of Hidden Markov Model modeling is the 5th state. F0 Root Mean Square Error of adult men to children after the conversion increased by 57.7%, while the F0 Root Mean Square Error of adult women to children after the conversion increased by 15.29%. Root Mean Square Error Cepstral after conversion increased by 43.69%. A subjective test was also performed in terms of the mean opinion score. In terms of similarities, mean opinion score testing for Hidden Markov Model has an average value of 2.64, and in terms of quality, testing mean opinion score for Hidden Markov Model has an average value of 3.23. It is hoped that this proposed method can be used in real terms for dubbing in the film industry, especially for Indonesian dialogue.
... de síntesi de parla(Szarkowska 2011, Fernández, Matamala i Ortiz-Boix 2012, Szarkowska i Jankowska 2012 i Ortiz-Boix 2012, els quals suposarien encara més la reducció dels costos. Aquests articles, no obstant, els tractarem més avant, ja que inclouen petits estudis de recepció (vegeu l'apartat 2.7.) ...
Full-text available
This PhD thesis is an experimental reception study on audio description (AD) for the blind and the visually impaired in which we test the effect of AD narration speed, intonation and explicitation on film comprehension and, secondarily, on enjoyment. First of all, we review and classify the main literature on AD up to 2012, with a special emphasis on Catalonia and Spain. After that, we present the theoretical frame chosen (Branigan 1992) and its concept of narrative schema, which allows us to define film comprehension and to have a theoretical basis to tackle some methodological aspects related to the experiment. Next, we explain the methodological aspects of the research, such as the selection, the narratological analysis and the manipulation of the corpus; the creation of the measurement scale of comprehension; the creation of the experimental groups (blind and visually impaired people) and the control group (sighted people); and other ethical and data processing issues. On the third place, we present a quantitative analysis of the results on comprehension and a qualitative analysis on the enjoyment for the different groups and values of the AD parameters which were manipulated. Finally, we highlight the tendencies observed in the analysis and we provide some recommendations which can be applied both to the professional and the teaching practice of AD and which complement the vagueness of most AD standards. Aquesta tesi doctoral consisteix en un estudi de recepció experimental sobre audiodescripció (AD) per a persones cegues i amb baixa visió en què es comprova com la velocitat de narració, l'entonació i l'explicitació de l'AD afecten la comprensió fílmica i, secundàriament, el gaudi. En primer lloc, es fa un repàs i una classificació de la bibliografia principal sobre AD fins al 2012, amb èmfasi especial a Catalunya i l'Estat espanyol. Després es presenta el marc teòric triat (Branigan 1992) que, amb el seu concepte d'esquema narratiu, ens serveix per a establir una definició de comprensió fílmica i una base teòrica per a desenvolupar alguns aspectes metodològics de l'experiment. Seguidament, expliquem aspectes metodològics de la tesi com la selecció, l'anàlisi narratològica i la modificació del corpus; la creació de l'escala de mesurament de la comprensió; la creació dels grups experimentals (persones cegues i amb baixa visió) i del grup de control (persones vidents); i diversos aspectes ètics i de processament de les dades. En tercer lloc, presentem una anàlisi quantitativa sobre els resultats de comprensió i una de qualitativa sobre el gaudi dels diferents grups i valors dels paràmetres de l'AD modificats. Finalment, apuntem les tendències observades en l'anàlisi i donem una sèrie d'orientacions que poden ser aplicades tant a la pràctica professional com a la docència de l'AD i que complementen la vaguetat de la majoria d'estàndards sobre AD.
... Examples of the more specific reception research include testing how intonation impacts AD comprehension (Cabeza i Cáceres 2011) as well as whether AD users are willing to accept alternative solutions in AD, such as first-person AD (Fels et al. 2006) or the use of synthetic voices (Szarkowska and Jankowska 2012). There were also a number of studies that tested the respondents' sense of presence and immersion into the story world (e.g. ...
The chapter discusses audio description (AD) which is an audiovisual translation method whereby images are translated into words. AD is primarily created with blind and visually impaired persons in mind, though it can also benefit people with no sight loss. The chapter starts with the definition of AD and its typologies, as well as with the discussion of its primary and secondary target audiences. This is followed by presentation of the main AD types: screen AD, (semi-)live AD and museum AD. Next, a short historical overview is offered, followed by the theoretical foundations and research which includes text-based, reception and experimental studies. The chapter ends with some didactic implications and an attempt to predict some future trends including those related to the Universal Design (UD) paradigm and the idea of an inclusive society.
... As well as high quality AD [22], this will positively impact the sense of presence [23] and, as a result, the user's immersion. In addition, previous studies have shown that text-to-speech is not as engaging as human voiced AD [24]. In order to provide an engaging experience, different solutions will have to be tested with visually impaired gamers. ...
Research Proposal
Full-text available
In films and TV shows a verbal commentary is included in order to help visually impaired people follow the plot, access information on set design and costumes as well as other visual aspects of the production. This verbal commentary is known as Audio Description (AD). Newer media experiences are still to catch up in terms of access services, and live streamed gaming is an example of that. Live streamed gaming is an emerging form of entertainment where video games being played are streamed in real time to thousands of spectators. Visually impaired people are already using text-to-speech software on the text chat of live streamed services; however, this technology does not exist on the video stream yet. This PhD project takes a user-centered approach to the study of the application of audio description to live streamed video games, by analysing its use from both theoretical and technological standpoints.
... In recent years, AD has received growing interest from researchers. However, most of the research concentrates on the textual aspect of AD, while the audio component has been largely neglected and so far studied only by a limited number of researchers (Fryer, 2010(Fryer, , 2016Remael, 2012;Szarkowska & Orero, 2014;van der Heijden, 2009van der Heijden, , 2011 especially in the context of text-to-speech AD (Fernández-Torné & Matamala, 2015;Fryer & Freeman, 2014;Linder & Martínez, 2011;Szarkowska & Jankowska, 2012;Walczak, 2011). Researchers have also shown some interest in the AD reading rate (Díaz-Cintas, 2006;McGonigle, 2007;Fels & Udo, 2009;Cabeza-Cáceres, 2013;Fryer, 2016;Vercauteren, 2016). ...
Full-text available
The study discussed in this article was carried out as a pilot study to assess the process, resources and data management scheme (Thabane et al., 2010) to be used in a large-scale experiment on filmic audio description (AD) reading rate. As part of this study we defined the reading rate in filmic AD context. We described the characteristic features of Polish filmic AD scripts and recordings and examined the reading rate of Polish AD for three Polish fiction films: a comedy, a drama, and an action film. We calculated the average length of breath pauses and the maximum, minimum and average reading rate measured in characters per second (CPS) and words per minute (WPM) – two measures commonly used in audiovisual translation. The main finding of this study is the validation of the research procedure for testing the AD reading rate. We also computed the average reading rate for Polish filmic AD (179 WPM) and discovered that it changes depending on the film genre (167 WP for drama, 182 for comedy and 189 for action). When it comes to breath pauses in Polish AD, we calculated their average length at 190 ms – a value much lower than expected for breath pauses in Polish. The results of our study are discussed in the context of research on the speech tempo.
Full-text available
Low health literacy is related to factors like chronic disease, communication impairments, low socioeconomic status and/or non-native background (Schaeffer et al. 2016, 41). Easy Language, a language variety with maximally enhanced comprehensibility, is a powerful instrument in this respect, as it proposes a solution to improve comprehension and recall of health information (Maaß and Rink 2017). Plain Language is more acceptable and less stigmatising, but it is also less comprehensible than EL (Bredel and Maaß 2016b, 186). The research projects of the BKMed group (in German: “Forschungsgruppe Barrierefreie Medizinkommunikation”, Research Group Accessible Medical Communication) of the Research Centre for Easy German at Hildesheim University focus on the comprehensibility and acceptability of different medical text types for different target groups who need optimised communication. Maaß (2020) proposes a balanced variety in form of an enriched version of Easy Language called Easy Language Plus to meet the need for comprehensible and acceptable texts.
Full-text available
‘Less is more’ is a statement that we easily tend to agree with. However, ‘simple is more’ might be a better saying. Less becomes more when a speaker gets to the point quickly, but more is more when a teacher takes the time to explain things in detail to a student. So, what type of ‘less’ leads to ‘more’? Is it the number of words, the time invested, or the way a message is constructed? This PhD explores this question with regards to audiovisual content. By doing so, it studies how Easy-to-Read can be used to simplify audiovisual content to make it more accessible for people with reading and learning difficulties.
Full-text available
This paper addresses the feasibility of text-to-speech audio description (TTS AD) 1 and its reception among visually impaired people. First, a new method of producing AD to be read by speech synthesis software is proposed. The method is then put to the test by examining a feature film The Day of the Wacko, screened with TTS AD. Finally, the results of the reception study are presented and discussed, followed by recommendations and suggestions for further research. The results of the TTS AD study – the first of its kind – demonstrate an untapped potential of the method. The majority of the blind and partially sighted respondents participating in the screening and survey declared they accepted TTS AD as an interim solution, while many others were also in favour of TTS AD becoming a permanent option.
Full-text available
An assessment is presented of the current situation of text reading comprehension research with Spanish-speaking blind persons. An overall view is offered on the operativity and evaluation of text reading comprehension processes, presented in a number of different media, such as braille, recordings and speech synthesis. A programme of future research is outlined.
Full-text available
Examined the reactions of 29 visually handicapped Swedish Ss (23-79 yrs old) who participated in a research project in which a daily newspaper was transmitted into their homes and read to them with the help of speech synthesis. With the use of computer and telecommunication techniques, almost the entire newspaper was transmitted. Reactions were, in general, very favorable, especially because Ss were able to receive a newspaper at the same time sighted people received theirs and had the chance to choose what to read easily. The intelligibility of the synthetic speech was considered high. Although the Ss read a lot, they mainly used few of the available commands. In areas in which it was possible to compare their reading habits with those of sighted persons, striking similarities were found.
Full-text available
In this work a review of speech technologies and their applications that provide or augment access to the printed or electronic information, the daily or social activities, as well as the private or public facilities for blind or low vision persons is presented. Speech technologies are currently considered to be essential for providing general purpose interfacing besides providing accessibility for the people who are visually impaired. Speech-enabled devices, reading machines, accessible computers, software applications, World Wide Web (www) content and structured environments constitute the main areas addressed throughout this paper with reference to the background technologies, architectures, formats, on-going research activities and projects. In the state-of-the art of the accessibility field, the speech communication channel is considered by the authors as one of the most important modality to benefit the blind and low vision persons.
This research investigated the ability of blind children to comprehend synthetic speech as a result of practice in listening to it. After four 11- to 13-year-old blind students from a state residential school for the blind were tested to establish their listening grade levels, they listened individually to tape recordings of the Kurzweil Reading Machine reading stories that were one level below their tested level. After each story, the children answered multiple-choice comprehension questions read aloud by the researcher by marking braille answer sheets. Each child listened to four stories per day for 10 days (total: 40 stories). Comprehension scores were obtained by calculating the percentage of correct answers to questions about each story. Statistical analyses of the children's scores revealed a significant difference among comprehensive scores at the .10 level of confidence and the presence of a linear trend at the .01 level of confidence. These results suggest that as exposure to synthetic speech increases, so does comprehension of synthetic speech.
This study investigated the effect of the length of sentences and the amount of exposure to synthesized speech on the intelligibility of a popular text-to-speech (TTS) output device. The results show that sentences with a mean word length of 11 words are as intelligible as sentences with a mean length of approximately 5 words. In augmentative and educational applications of TTS output, it appears that relatively long sentences may be used without fear of adverse impact on intelligibility. The study also shows that there is a large and linear increase in intelligibility with exposure to synthesized speech up to the first 5 sentences heard by listeners within an experimental session. Thereafter, there is a less marked and variable amount of increase in intelligibility up to the next 15 sentences heard. Exposure beyond the first 20 sentences does not appear to contribute to increased intelligibility within a session. Exposure to the Echo II synthesizer over a 3-day period showed that subjects improve significantly in their ability to understand TTS output from day 1 to day 2, indicating that people can learn to comprehend TTS output. Probably because of a ceiling effect, no further improvement occurred on day 3. Data also suggested that those who are least able to comprehend TTS initially improve most following exposure to synthesized speech.
Evaluation of financial document design formed the central part of a recent research project by the Royal National Institute for the Blind (RNIB). The study, entitled The Presentation of Financial Information for Visually Impaired People, investigated financial information design in four alternative formats: large print; Braille; audio cassette; and computer disk. The results have raised some interesting implications for designers of financial documentation. Several design feature preferences were drawn from the research which are being recommended as general guidelines for financial information design. These results are outlined within a discussion of the importance of this type of research not only for visually impaired people themselves but also for financial institutions as a result of impending legal requirements under the Disability Discrimination Act (1995). Future research on this topic by the RNIB is discussed, as are some thoughts on incorporating creativity with accessible design.
Exploring the Use of Synthetic Speech by Blind and Partially Sighted People
  • H Cryer
  • S Home
Cryer, H. and Home, S. 2008. " Exploring the Use of Synthetic Speech by Blind and Partially Sighted People. " RNIB Centre for Accessible Information, Birmingham: Literature review #2.
User Attitudes towards Synthetic Speech for Talking Books
  • H Cryer
  • S Home
Cryer, H. and Home, S. 2009. " User Attitudes towards Synthetic Speech for Talking Books ". RNIB Centre for Accessible Information, Birmingham: Research report #7.
The Guardian Available at: User Attitudes towards Synthetic Speech for Talking Books
  • P H Bradshaw
  • S Home
Bradshaw, P. 2006. " Volver. " The Guardian, 25 August 2006. Available at: pedroalmodovar Cryer, H. and Home, S. 2009. " User Attitudes towards Synthetic Speech for Talking Books. " RNIB Centre for Accessible Information, Birmingham: Research report #7.