ArticlePDF Available

Analysis of preferred speaking rate and pause in spoken easy Japanese for non-native listeners

Authors:

Abstract and Figures

We investigate the effect of speaking rate and pauses on the perception of spoken Easy Japanese, which is Japanese language with mostly easy words to facilitate understanding by non-native speakers. In this research, we used synthetic speech with various speaking rates, pause positions, and pause lengths to investigate how they correlate with the perception of Easy Japanese for non-native speakers of Japanese. We found that speech rates of 320 and 360 morae per minute are perceived to be close to the ideal speaking rate. Inserting pauses in natural places for Japanese native speakers, based on the dependency relation rule of Japanese, makes sentences easier to listen to for non-native speakers as well, whereas inserting too many pauses makes the sentences hard to listen to.
Content may be subject to copyright.
Analysis of preferred speaking rate and pause in spoken easy Japanese
for non-native listeners
Hafiyan Prafiyanto
, Takashi Nose
y
, Yuya Chiba
z
and Akinori Ito
x
Graduate School of Engineering, Tohoku University,
Aramaki Aza-Aoba 6–6–05, Aoba-ku, Sendai, 980–8579 Japan
(Received 10 April 2017, Accepted for publication 1 August 2017)
Abstract: We investigate the effect of speaking rate and pauses on the perception of spoken Easy
Japanese, which is Japanese language with mostly easy words to facilitate understanding by non-native
speakers. In this research, we used synthetic speech with various speaking rates, pause positions, and
pause lengths to investigate how they correlate with the perception of Easy Japanese for non-native
speakers of Japanese. We found that speech rates of 320 and 360 morae per minute are perceived to be
close to the ideal speaking rate. Inserting pauses in natural places for Japanese native speakers, based
on the dependency relation rule of Japanese, makes sentences easier to listen to for non-native speakers
as well, whereas inserting too many pauses makes the sentences hard to listen to.
Keywords: Language for special purposes, Second language acquisition, Educational technology &
language
PACS number: 43.71.Gv [doi:10.1250/ast.39.92]
1. INTRODUCTION
Nowadays many foreigners live in Japan, and the
number of the foreigners has doubled since 1990 [1],
reaching more than 2 million residents. However, many of
them do not speak Japanese very fluently, and sometimes
they cannot understand public information provided in the
Japanese language. One solution is to provide the informa-
tion using common foreign languages, such as English,
Chinese, and Korean. In times of disaster in particular,
public information can become vital: when a disaster such
as a tsunami or earthquake strikes, an inability to under-
stand information quickly can become a matter of life and
death, and at such times it is impractical to provide such
information in several languages because of the time
needed for translation.
A more practical way to convey information to non-
native speakers of Japanese is to use simplified Japanese
that has a limited vocabulary of easy words and avoids
difficult grammar. ‘‘Easy Japanese,’’ proposed by Sato
et al., is one implementation of this idea [2]. It defines the
vocabulary and grammatical structures for making Japa-
nese sentences easily understood by foreigners. Easy
Japanese was used in written announcements such as
posters that were displayed in places of refuge during the
Great East Japan Earthquake in 2011, helping many
foreigners.
In addition to written announcements, it is often
necessary to convey information in spoken language. The
radio broadcast is an important way of announcing public
information under a disaster, because a battery-powered
radio broadcast receiver can be used when a blackout
happens. The announcement through the radio broadcast is
usually spoken by a human announcer; in addition, use of
synthetic speech is also being considered [3,4]. The main
advantage of using synthetic speech instead of a human
announcer is a more streamlined process from composing
the information to broadcasting it. Faster compilation and
broadcasting of information is very helpful in times of
disasters, where information needs to be updated frequent-
ly. Therefore, the use of speech synthesis for Easy Japanese
broadcasts is worth considering. For announcements, the
temporal properties of the spoken utterances can affect
the perception and understanding of the announcement.
To ensure that spoken announcements can be understood
by the listeners, it is necessary to investigate the effect of
those properties on the perception of Easy Japanese
announcements.
This research examined prosodic parameters that might
affect the perception and intelligibility of spoken an-
e-mail: hafiyan@spcom.ecei.tohoku.ac.jp
y
e-mail: tnose@m.tohoku.ac.jp
z
e-mail: yuya@spcom.ecei.tohoku.ac.jp
x
e-mail: aito@spcom.ecei.tohoku.ac.jp
92
Acoust. Sci. & Tech. 39, 2 (2018) #2018 The Acoustical Society of Japan
PAPER
nouncements in Easy Japanese for non-native listeners. In
this paper, we investigate the effect of speaking rate, pause
position, and pause length, hoping to identify an optimum
speaking rate, pause position, and pause length. Assess-
ments of listeners’ understanding and preference of the
perceived speech were used to judge the optimality.
2. BACKGROUND
2.1. Concept and Design of Easy Japanese
Easy Japanese was proposed by Sato et al. in 1999
following the Hanshin-Awaji Great Earthquake in 1995
[2]. It is a similar idea to Basic English [5]. Easy Japanese
is designed to be easier to be understood by foreigners
living in Japan. The Easy Japanese Guideline published
by the Sociolinguistic Laboratory of Hirosaki University
provides some rules for composing information in Easy
Japanese [6], such as: (a) The vocabulary should be
restricted to the words that appear in the N4 Level of the
Japanese Language Proficiency Test (JLPT) or easier, (b)
Terms related to the main event, e.g. ‘earthquake,’ should
be used even if they are not included in (a), (c) Sentences
written in Easy Japanese should be grammatically simple
sentences (not compound sentences), (d) Sentences should
be short (around 24 morae on average), and (e) Expressions
that suggest possibility or supposition (e.g., ‘might be’ or
‘it could happen’) should be avoided.
An example of a normal Japanese sentence and its
translation to Easy Japanese is shown in Fig. 1. Consid-
ering the sentence in normal Japanese, the meaning of the
word ‘teiden’ (blackout) can be easily understood by a
native speaker, but it might be a difficult word for a non-
native speaker. The structure of this sentence is also
difficult; to make it easier to understand, the structure is
changed while keeping the intended meaning, using easier
words like ‘denki’ (electricity). Also, unnecessary infor-
mation is deleted to make the sentence shorter.
2.2. Support Systems Developed for Easy Japanese
One problem with converting normal Japanese senten-
ces to Easy Japanese sentences is that it is hard for average
native Japanese speakers to judge the easiness of words for
non-native speakers, so it is difficult to decide which words
need to be converted. To help native speakers writing
sentences in Easy Japanese, Nagano and Ito [7] developed
the software YANSIS (YAsashii Nihongo SIen System in
Japanese, Easy Japanese Support System). YANSIS can
point out the parts of the input sentence that do not follow
the rules of Easy Japanese, such as words and phrases that
might be difficult for non-native speakers. YANSIS also
has a function to determine the easiness of a sentence [8].
This function assigns a score of easiness to each input
sentence based on sentence structure, difficulty of words,
use of foreign words, and use of symbol characters.
2.3. Effect of Prosodic Properties on the Perception of
Spoken Language by Non-native Listeners
In past studies, most investigations on the understand-
ing of Japanese language by non-native listeners were done
at the segmental level like syllable, mora, and phoneme.
For example, there are studies regarding non-native
Japanese listeners’ ability to discern difficult phonemes
like affricate consonants [9] or special morae, including
geminates [10] and elongated vowels [11]. These studies
provide important clues on making utterances easier to
understand. On the other hand, properties at the supra-
segmental level, i.e. prosody, also affects the comprehen-
sion of the sentence. Example of such prosodic properties
include speaking rate, rhythm [12], pause insertion,
fundamental frequency, and intensity [13]. Studies on the
effect of prosodic properties on the intelligibility of spoken
sentences are still mostly done on native listeners of
Japanese, particularly those with impaired hearing due to
aging [14]. We can see that the effect of these prosodic
factors on the perception of spoken Easy Japanese by non-
native listeners has still not been sufficiently investigated.
The effect of prosodic properties on the intelligibility
of sentences for non-native speakers has been studied for
many other languages. Particularly, speaking rate is a
parameter widely believed to be related to intelligibility
[15]. For English, a fast speaking rate has been shown to
hinder understanding for non-native speakers [16]. In
contrast, the benefit of a slow speaking rate on under-
standing is disputed [17], and appears only in limited
circumstances. A speaking rate which is too slow can even
hinder understanding, possibly because it disturbs the
listener’s concentration [18]. Besides speaking rate, pause
is another important prosodic property that has been
researched in some languages. In English and many other
languages, pauses in appropriate positions can help under-
standing [19]. However, too many pauses, or pauses that
are too long, can also hinder understanding for non-native
speakers [17].
Prosodic properties of utterances are, to a certain
degree, language dependent. An analysis of several
Fig. 1 Example of converting from normal Japanese to
Easy Japanese.
H. PRAFIYANTO et al.: PREFERRED SPEAKING RATE AND PAUSE IN EASY JAPANESE
93
languages shows that Japanese is a comparatively fast
language, with around 7.84 syllables per second in every-
day conversation as opposed to 5.22–7.18 syllables per
second for many other languages [20]. For non-native
Japanese listeners, this may mean that slower Japanese
language utterances can be helpful, even more helpful than
for other languages. Therefore, the effect of speaking rate
on the understanding of Japanese language by non-native
listeners needs to be investigated. Furthermore, the
appropriate locations of pauses are greatly affected by the
relations between the words decided by grammatical rules.
Because grammatical rules are language dependent, it is
necessary to investigate the effect of pauses in Japanese
language as well.
3. SPOKEN EASY JAPANESE FOR NON-
NATIVE LISTENERS
The intelligibility of a written announcement depends
mostly on the difficulties of its words and grammatical
rules. Using Easy Japanese can help keeping these
difficulties minimum. When a script written in Easy
Japanese is orally announced, additional factors, such as
the speaking rate and pauses, will also influence its
intelligibility. To deliver spoken announcements in Easy
Japanese that can be easily understood, we need to
investigate how these factors affect the perception of
spoken Easy Japanese sentences.
In this study, we use synthetic speech to investigate
preferable prosodic properties. As mentioned in the
introduction, speech synthesis enables the information to
be delivered faster than relying on human announcers.
Also, speech synthesis can be used in places such as small
schools that have a network of loudspeakers but do not
have staffs trained to deliver announcement professionally.
Table 1 summarizes the measures we use in this
research. We find the understanding of a sentence using a
subjective measure and an objective measure. We use
the term ‘‘comprehensibility’’ for the subjective measure
and ‘‘intelligibility’’ for the objective measure. Both are
common ways to measure understanding [21]. To assess
the comprehensibility, the listener assesses his/her own
understanding. Meanwhile, dictation is used to assess the
intelligibility. Some authors, including Kachru and Smith
[22] argue that an ability to write down an utterance shows
some capabilities of recognition of a word or another
sentence-level element of an utterance, and can be used to
measure intelligibility eventhough the listener could not
understand the whole meaning. We expect that these
measures can be higher when a sentence is spoken slowly
(slow speaking rate, more pauses, or longer pauses
inserted). However, we expect that a sentence that is too
slow might not have much improvement to the intelligi-
bility and comprehensibility compared to a slightly slow
sentence, and might even have negative effects. Therefore,
we also find listeners’ preference of the speed of the
sentence by measuring the ‘‘speed adequacy’’ and the
‘listenability.’’ Speed adequacy is measured by asking
the listener how close the speed felt compared to the speed
he/she feels most appropriate [23], while listenability is
measured by asking the listener how easy to listen to the
sentence [24]. The listenability is often used as an index of
speech quality of synthesized or enhanced speech [25,26].
We conducted three experiments to investigate the
optimal speaking rate, pause position, and pause length.
Finally, we also conducted an experiment to investigate
whether applying all of those optimal conditions would
make sentences sound better, by comparing it with the
sound at the standard speaking rate. In total, 67 subjects
participated. The participants are international students
from 8 countries, most (62 participants) came from Asian
countries. 23 participants had taken Japanese Language
Proficiency Test, while most participants that never took
the test reported that they did not use much Japanese in
their daily life.
4. INVESTIGATION OF APPROPRIATE
PROSODIC PARAMETERS
4.1. Speech Synthesis Based on Hidden Markov
Models (HMMs)
We used HTS as a speech synthesizer, which is a
widely-used parametric speech synthesis system based on
hidden Markov models (HMMs) [27]. The speech synthe-
sized using this method can achieve an intelligibility
comparable to natural human speech [28]. Also, this
method lets us control the duration of each phoneme, and
we use this to control the speaking rate.
Pauses are usually considered to be a prosodic feature,
but can be modelled as phonemes in HMM-based speech
synthesis. The advantage of treating a pause as a phoneme
is the ability to insert pauses by simply adding the pause
phoneme. The length of the pause can also be set by
explicitly defining the duration of the pause phoneme.
4.2. Speaking Rate Control
To find an appropriate speaking rate, we synthesize
Table 1 Measures used in this research.
Measure Description
Comprehensibility Subjective measure of understanding
based on self assessment
Intelligibility Objective measure of understanding
based on dictation
Speed adequacy Preference of speed based on how
close to the most appropriate speed
Listenability Ease of listening
Acoust. Sci. & Tech. 39, 2 (2018)
94
speech with various speaking rates. The speaking rates are
controlled by uniformly and linearly converting phoneme
durations in the parameter generation process. A survey of
broadcasted programs shows that the average speaking rate
on Japanese programs ranges from 450 to 570 morae per
minute [29]. We assume that an easier speaking rate for
understanding by non-native speakers would be slower, so
in this research we tested the effect of five speaking rates:
240, 280, 320, 360, and 400 morae per minute.
4.3. Pause Insertion
4.3.1. Pause position
In speech communication, it is known that pause
control is very important and pauses inserted in appropriate
positions can help a spoken sentence sound more natural
and intelligible [19]. In this study, we examined three types
of pause insertion position as follows:
(a) None: No pause is inserted.
(b) Dependency: Pauses are inserted according to the
rule described below, illustrated in Fig. 2(a).
(c) Phrase: Pauses are inserted between all phrases,
illustrated in Fig. 2(b).
A Japanese sentence consists of one or more phrases
[30], which are important for determining the positions of
pauses. A typical Japanese phrase (bunsetsu) contains one
content word and zero or more function words such as
particles, suffixes or auxiliary verbs. In Japanese, a pair of
phrases can have a dependency relation, with one phrase
being a head and another being a dependent. Japanese is
a head-final language, which means that the head always
comes after the dependent, but not always immediately
after it. The distance between the head and the dependent
in the unit of phrase is called the dependency relation
distance. If this distance is more than one, which means
there are one or more phrases sandwiched between the
head and the dependent, it is likely that a pause is in-
serted immediately after the dependent in natural Japanese
speech [31].
Even though the dependency-based pause insertion
points are natural for a native speaker of Japanese, more
pauses might help non-native speakers of Japanese to
understand the sentence better. The Easy Japanese Guide-
line suggests inserting pauses between each phrase boun-
dary [6].
4.3.2. Pause length
For a silent region within speech to be recognized as
a pause, the length must be at least around 200 ms [19].
However, the best pause length for non-native speakers is
not known, and it might be longer. In this research, we
examined pause lengths between 200 and 800 ms.
5. EXPERIMENTS
5.1. Experimental Procedure
In our research, we conducted three listening tests. The
first experiment examined the effect of speaking rate, the
second examined the effect of the positions of pauses
within the sentence, and the third examined the effect of
pause length.
Each experiment consisted of two parts, the subjective
evaluation part and the dictation part. In the subjective
evaluation part, the subjects of the experiment listened to
fifteen Easy Japanese sentences with either the speaking
rate, pause insertion rule or pause length controlled, and
then rated the listenability score and the comprehensibility
score of each sentence. The listenability score was rated by
asking how easy to listen to the sentence, with the score
ranging from 1 (very hard) to 5 (very easy). The
comprehensibility score was rated by asking how much
of the meaning of the sentence is understood, with the score
ranging from 1 (does not understand at all) to 5 (completely
understand). Finally, the subjects also rated how fast the
sentence felt on a scale from 1 (very slow) through 3 (just
right) to 5 (very fast). Using the rated speed, we defined a
speed adequacy score, which showed how close the speed
felt compared to the most appropriate speed. Because we
defined option 3 as ‘‘just right’’ for the perceived speed
question, the perceived speed of 3 is defined as the most
appropriate speed, with the speed adequacy of 3. The
perceived speed of 2 and 4 have the speed adequacy of
2, while the perceived speed of 1 and 5 have the speed
adequacy of 1.
We objectively measure intelligibility in the dictation
part. In this part, the subjects listened to eight Easy
Japanese sentences with various speaking rate and pause
conditions, then typed each sentence. From the subjects’
answers, we defined a dictation score for each sentence.
The dictation score was calculated using the Levenshtein
distance between the spoken sentence and the subject’s
answer, inverted and linearly normalized between 0
(maximum distance, lowest score) and 1 (zero distance,
perfect score).
Let sans and sref be the subject’s dictation and the
reference sentence, respectively, written in hiragana. Let
(a)
(b)
Fig. 2 Pause insertion rule explanation for (a) depend-
ency and (b) phrases.
H. PRAFIYANTO et al.: PREFERRED SPEAKING RATE AND PAUSE IN EASY JAPANESE
95
nans and nref be the numbers of characters of sans and sref ,
respectively. Then the dictation score Sdict was calculated
as follows:
Sdict ¼
maxðnans;nref ÞdLðnans ;sref Þ
maxðnans;nref Þ
Here, dLis the Levenshtein distance.
Table 2 shows the sentences used in this experiment.
Sentences 1 to 15 were used in the subjective evaluation
part of the experiments, while the dictation part of the
experiments used sentences 1 to 8 only. The slashes (/)
show the positions of pauses inserted for the pause position
experiment; the single slashes indicate the positions
inserted according to the phrase rule, and the double
slashes indicate the positions inserted according to both the
dependency rule and the phrase rule.
After conducting the experiments regarding speaking
rate, pause position, and pause length, we tried to
determine the best condition for each factor. We then
conducted an experiment to investigate whether applying
all of those conditions would make sentences sound better,
by comparing it with the sound at the standard speaking
rate. Participants were asked to listen to sentences with
standard prosodic parameters and sentences with tuned
prosodic parameters, then rate which ones were more
subjectively listenable, intelligible, and natural.
5.2. Result for Speaking Rate
In the experiment regarding speaking rate, the subjects
listened to Easy Japanese sentences with five speaking rates
(240, 280, 320, 360 and 400 morae per minute). The
sentences contained no pauses. The subjects were 21
international students at a Japanese university who had
lived in Japan for 4 to 70 months (average: 2 years). 19
participants came from Asian countries (China and
Indonesia), while 2 participants came from Latin America.
7 participants had JLPT qualifications.
Figure 3(a) shows the relation between the speaking
rate and the average speed adequacy. The graph shows that
the speaking rates of 320 and 360 morae per minute have
the highest speed adequacies. Using one-way ANOVA, the
speaking rate was shown to be a statistically significant
(p<0:0001) factor influencing the speed adequacy. Using
Tukey’s test, we found that the average speed qualities of
both 320 and 360 morae per minute were significantly
higher than 240 morae per minute ( p<0:0001) and 280
morae per minute ( p<0:001).
Figures 3(b)–3(d) show the relation between the speak-
ing rate and, respectively, the average listenability, the
comprehensibility, and the dictation score. Using ANOVA,
the differences between each rate for those scores were
not statistically significant at the 5% significance level,
suggesting that the scores were hardly affected by the
speaking rate.
Based on the results of this experiment, we conclude
that the speaking rates of 320 and 360 morae per minute are
significantly better. Furthermore, because the speaking rate
of 360 morae per minute can deliver information faster, it
is preferable to 320 morae per minute for efficiency of
information communication. Therefore, for the next ex-
Table 2 The sentences used in the experiments and their English translations.
No. Sentence (Japanese, romanized) English translation No. Sentence (Japanese, romanized) English translation
1 Nigeru / mae ni // hi ga / Please check / once again // if 9 Kuruma ya / jitensha o // Please take refuge / on foot //
kieteiru ka // mo¯ ichido / the fire / has been extinguished // tsukawanaide // aruite / and do not use / cars /
mite kudasai before / you take refuge nigete kudasai or bicycles
2 Takusan no / ame ga / futta / When / heavy / rain / occurs / 10 To¯ hoku daigaku ni / ikitai / Those who / want to / go to /
toki wa // takai / tokoro e / go / to a higher / ground hito wa // goban no / basu ni Tohoku University // should
nigete kudasai notte kudasai take / bus / number five
3 Kyo¯ no / yoru ni // ko¯ en There will be / a Japanese food / 11 Su
¯pa¯ma¯ketto dewa // You / can / buy / water /
de // nihon ryo¯ri no / party // in the park // at night / mizu ya / tabemono o / kau / and food // at the supermarket
pa¯tı¯ ga / arimasu today koto ga / dekimasu
4 Sendai dewa // gozen / ju
¯ji In Sendai // buses / and trains / 12 Sendai ku
¯ko¯ de wa // In Sendai airport // none of /
kara // basu ya / densya ga / run / from 10 / in the morning ju
¯ichigatsu / niju
¯gonichi made // the planes / will fly // until
ugokimasu zenbu no / hiko¯ki ga / tobimasen November / 25th
5 Maishu
¯no / suiyo¯bi wa // Every week / on Wednesday // a 13 Abunai to / omotta / toki wa // Please call / other / people /
daigaku de // nihongo no / Japanese / language class / is chikaku ni / iru / hito o / nearby // when you / think /
jugyo¯ ga / arimasu held / in the university yonde kudasai it is dangerous
6 Kusuri ga / hitsuyo¯na / hito Please go / to the hospital / 14 Kyo¯shitsu no / so¯ ji ga / After you / clean / the class //
wa // chikaku ni / aru / located / nearby // if / you need / owattara // sensei ni / tsutaete please tell / the teacher
byo¯in e / itte kudasai medicines kudasai
7 Sendai kara // To¯kyo¯ ni / From Sendai // it is / convenient/ 15 To¯kyo¯ dewa // kyo¯ In Tokyo // it will rain /
iku / toki wa // densha o / to go / to Tokyo / using / train no / hiru kara / yoru made // heavily / today // from noon /
tsukau to / benri desu ame ga / takusan / furimasu until night
8 Moyasu / koto ga / dekiru / gomi Please put out / trash / that /
wa // kinyo¯bi ni / sutete can be / burnt // on Friday
kudasai
Acoust. Sci. & Tech. 39, 2 (2018)
96
periments regarding pauses, we determined the effect of
pause insertion in sentences spoken at the rate of 360
morae per minute.
5.3. Result for Pause Position
In the experiment regarding pause position, the subjects
listened to Easy Japanese sentences with the three pause
position rules (none, dependency, phrase). The pause
length was fixed to 500 ms. The subjects were 18 interna-
tional students at a Japanese university who were not a
subset of the participants in the previous experiments,
although they too had lived in Japan for 4 to 70 months
(average: 2 years). 16 participants came from Asian
countries (China, Indonesia, Korea), while 3 participants
came from Europe. 6 participants had JLPT qualifications.
Figure 4(a) shows the average speed adequacy for each
pause position rule. The graph shows that the pause
position based on the dependency relation rule had the
highest speed adequacy. Using one-way ANOVA, the
pause position rule was shown to be a statistically
significant ( p<0:0001) factor influencing the speed
adequacy. Using Tukey’s test, we found that the pause
position based on the dependency relation rule had the
significantly highest speed adequacy (p<0:05), while
inserting pauses between every phrase had the lowest
adequacy ( p<0:001).
Figure 4(b) shows the average listenability for each
pause position rule. The graph shows that the pause
position based on the dependency relation rule had the
highest listenability. Using one-way ANOVA, the pause
position rule was shown to be a statistically significant
(p<0:001) factor influencing the listenability. Using
Tukey’s test, we found that the pause position based on
the dependency relation rule had the significantly highest
listenability ( p<0:05), which means that pause insertion
is important for listenability, but inserting too many pauses
makes the sentence harder to listen to.
Figures 4(c) and 4(d) show the average comprehensi-
bility and dictation score for each pause position rule.
Using ANOVA, the difference between each rule for those
scores was not statistically significant at the 5% signifi-
cance level. This suggests that the comprehensibility and
dictation scores were hardly affected by the pause position.
Based on the results of this experiment, we conclude
that the pause position based on the dependency rule has
significantly higher listenability. Therefore, in the next
(b) (c) (d)(a)
Fig. 4 Average scores for each pause position rule: (a) speed adequacy, (b) listenability, (c) comprehensibility, and
(d) dictation score (:p<0:05,:p<0:001, :p<0:0001).
(a) (b) (c) (d)
Fig. 3 Average scores for each speaking rate: (a) speed adequacy, (b) listenability, (c) comprehensibility, and (d) dictation
score (:p<0:001, :p<0:0001).
H. PRAFIYANTO et al.: PREFERRED SPEAKING RATE AND PAUSE IN EASY JAPANESE
97
experiment, we determined the effect of pause length by
inserting pauses at the positions determined by this rule.
5.4. Result for Pause Length
In the experiment regarding pause length, the subjects
listened to Easy Japanese sentences with five conditions
of pause length (200, 350, 500, 650, and 800 ms). The
subjects were 17 international students at a Japanese
university who had lived in Japan for 1 to 77 months
(average: 2 years). All participants came from Asian
countries (China, Indonesia, Korea). 7 participants had
JLPT qualifications.
Figure 5(a) shows the relation between the speaking
rate and the average speed adequacy. The graph shows
that the pause lengths of 200–500 ms had higher speed
adequacy than those of 650 and 800 ms. Using one-way
ANOVA, pause length was shown to be a statistically
significant ( p<0:05) factor influencing the speed adequa-
cy. However, using Tukey’s test, we found no significant
difference at the 5% significance level between the average
speed adequacies of any scores. This suggests that while
pause length influenced how fast a sentence sounds, we
cannot say with confidence which pause length was better.
Figures 5(b)–5(d) show the relation between the pause
length and, respectively, the average listenability, the
comprehensibility, and the dictation score. Using ANOVA,
the differences between each pause length for those scores
were not statistically significant at the 5% significance
level. This suggests that those scores were hardly affected
by the pause length.
Based on the results of this experiment, we conclude
that pause length has no significant effect on listenability or
comprehensibility. It can be argued that the pause length of
200 ms is the best option, because it allows information to
be transferred faster.
5.5. Comparison of Standard and Tuned Speech
In the experiment comparing standard and tuned
speech, the subjects listened to Easy Japanese sentences
with standard parameters of the speech synthesizer (speak-
ing rate of 417 morae per minute, no pauses) and tuned
parameters (speaking rate of 360 morae per minute, pause
position based on the dependency relation rule, pause
length of 200 ms). The subjects were 10 international
students at a Japanese university who had lived in Japan for
2 years on average. All came from Asian countries (China
and Indonesia). 3 participants had JLPT qualifications.
Figure 6 shows the comparison between the standard
speech and the tuned speech. The tuned speech was found
subjectively to be more listenable and intelligible, but less
natural than the standard speech. Using the population
proportion test, the differences were found to be significant
(p<0:05). The speech rate of 360 morae per minute is
rather slow and unlike usual speeches in broadcasts; this
might explain the decrease in the naturalness.
6. DISCUSSION
We mentioned in the introduction that one important
reason for this research is to make spoken announcements
such as radio broadcasts in Easy Japanese easy to listen
and understand for non-native speakers. In this study, we
measured understanding using comprehensibility and in-
telligibilty. We measured preference using speed adequacy
and listenability. On the experiments measuring the effect
of various prosodic conditions on the understanding and
preference, we found that generally, the understanding was
not affected much, but the preference was affected: a
speaking rate of 360 morae per minute is perceived to be
close to the ideal speed by non-native speakers and has
high listenability when pauses are inserted in appropriate
(b) (c) (d)(a)
Fig. 5 Average scores for each pause length: (a) speed adequacy, (b) listenability, (c) comprehensibility, and (d) dictation
score.
Fig. 6 Comparison between standard and tuned speech.
Acoust. Sci. & Tech. 39, 2 (2018)
98
positions based on the dependency relation rule. In this
section, we will discuss about the validity of this measure-
ment, and other things that need to be considerated when
applying our findings.
The methods used to measure the comprehensibility
(self assessment) and the intelligibilty (dictation) in this
study has some caveats. Self assessment cannot catch
genuine misunderstanding on the listener’s part, so it might
overestimate the understanding. Dictation requires memo-
rization skill and writing skill in addition to understanding,
so it might underestimate the understanding. However, we
found that the correlations between the comprehensibility
and the intelligibilty ranged from moderate to strong
(speaking rate: r¼0:52, pause position: r¼0:81, pause
length: r¼0:78), showing that they were reliable to a
certain degree. Although we expected that the understand-
ing would be higher for slower conditions based on
observations in daily life, the fact that no differences were
found was not unprecedented, and agrees with some
previous studies [17,18].
We found that the speed adequacy was the only
measure significantly affected by all of the conditions.
However, the differences in the perceived did not translate
to the difference in the intelligibilty and the comprehen-
sibility. This suggests that even for conditions thought to be
faster than ideal, the subjects can still comprehend parts of
the sentence or get the dictation of easier words right.
The speaking rate of 360 morae per minute was
perceived to be close to the ideal speed by non-native
speakers. This rate is substantially slower than the average
speaking rate usually found in programs for native speak-
ers, which ranges from 450 to 570 morae per minute [29].
An important point about radio broadcasts is that they are
listened to by both native speakers and non-native speak-
ers, who have different preferences for speaking style.
Before applying the results of this research, it is important
to consider the best way of speaking announcements from
the viewpoint of universal design.
7. CONCLUSIONS
We investigated the effects of prosodic properties
(speaking rate, pause position and pause length) on the
perception of speech. We used synthetic speech with
various conditions to investigate how they correlate with
the intelligibility and listenability of spoken Easy Japanese.
We found that a speaking rate of 360 morae per minute
with 200 ms pauses to be close to the ideal speaking speed.
It is also more appropriate to insert pauses at appropriate
natural positions for native speakers, based on the depend-
ency relation rule of Japanese language, as opposed to
inserting pauses between every phrases. Speech under
these conditions was found to be more listenable and
intelligible than speech at the standard speaking rate.
ACKNOWLEDGEMENT
Part of this work is supported by JSPS KAKENHI
Grant-in-Aid for Scientific Research (B) Grant Number
JP26284069 and JP16K13253. We also give thanks to
Prof. Kazuyuki Sato of Hirosaki University for useful
discussions.
REFERENCES
[1] Statistics Bureau, ‘Japan Statistical Yearbook,’’ http://
www.stat.go.jp/english/data/nenkan/index.htm (2017).
[2] Y. Miyazaki, ‘‘Yasashii nihongo (Easy Japanese) on commun-
ity media: Focusing on radio broadcasting,’Kwansei Gakuin
Policy Stud. Rev.,8, 1–14 (2007).
[3] L. F. Lamel, J. L. Gauvain, B. Prouts, C. Bouhier and R.
Boesch, ‘‘Generation and synthesis of broadcast messages,’
Proc. J. ESCA-NATO Workshop and Applications of Speech
Technology, pp. 207–210 (1993).
[4] Z. Hanzlı
´c
ˇek, J. Matous
ˇek and D. Tihelka, ‘‘Towards automatic
audio track generation for Czech TV broadcasting: Initial
experiments with subtitles-to-speech synthesis,’Proc. 9th Int.
Conf. Signal Processing, pp. 2721–2724 (2008).
[5] C. K. Ogden, Basic English as an International Second
Language (Harcourt, Brace & World, San Diego, 1968).
[6] Hirosaki University’s Sociolinguistics Laboratory, ‘Easy
Japanese Guideline,’http://human.cc.hirosaki-u.ac.jp/kokugo/
ej-gaidorain.pdf (2014) (in Japanese).
[7] T. Nagano and A. Ito, ‘‘YANSIS: An ‘Easy Japanese’ writing
support system,’Proc. Int. Conf. ICT for Language Learning,
pp. 273–279 (2015).
[8] M. Zhang, A. Ito and K. Sato, ‘‘Automatic assessment of
easiness of Japanese for writing aid of ‘Easy Japanese’,’Proc.
Int. Conf. Audio, Language and Image Processing, pp. 303–
307 (2012).
[9] K. Yamakawa, Y. Chisaki and T. Usagawa, ‘‘Subjective
evaluation of Japanese voiceless affricate spoken by Korean,’
Acoust. Sci. & Tech.,27, 236–238 (2006).
[10] M. S. Han, ‘‘The timing control of geminate and single stop
consonants in Japanese: A challenge for nonnative speakers,’
Phonetica,49, 102–127 (1992).
[11] Y. Hirata, ‘‘Training native English speakers to perceive
Japanese length contrasts in word versus sentence contexts,’
J. Acoust. Soc. Am.,116, 2384–2394 (2004).
[12] V. Dellwo and P. Wagner, ‘‘Relationships between rhythm and
speech rate,’Proc. 15th Int. Congr. Phonetic Sciences,
pp. 471–474 (2003).
[13] M. Ostendorf, I. Shafran and R. Bates, ‘‘Prosody models for
conversational speech recognition,’Proc. 2nd Plenary Meet.
Symp. Prosody and Speech Processing, pp. 147–154 (2003).
[14] Y. Nejime, T. Aritsuka, T. Imamura, T. Ifukube and J. I.
Matsushima, ‘‘A portable digital speech-rate converter for
hearing impairment,’IEEE Trans. Rehabil. Eng.,4, 73–83
(1996).
[15] J. Rubin, ‘‘A review of second language listening comprehen-
sion research,’Mod. Lang. J.,2, 199–221 (1994).
[16] R. Griffiths, ‘‘Speech rate and NNS comprehension: A
preliminary study in time-benefit analysis,’Lang. Learn.,40,
311–336 (1990).
[17] T. Derwing, ‘‘Speech rate is no simple matter,’Stud. Second
Lang. Acquis.,12, 303–313 (1990).
[18] D. E. Berlyne, Conflict, Arousal and Curiosity (McGraw-Hill,
New York, 1960).
[19] B. Zellner, ‘‘Pauses and the temporal structure of speech,’
H. PRAFIYANTO et al.: PREFERRED SPEAKING RATE AND PAUSE IN EASY JAPANESE
99
in Fundamentals of Speech Synthesis and Speech Recognition,
E. Keller, Ed. (John Wiley & Sons, Chichester, 1994) pp. 41–
62.
[20] F. Pellegrino, C. Coupe
´and E. Marsico, ‘‘A cross-language
perspective on spech infomration rate,’Language,87, 539–
558 (2011).
[21] M. J. Munro and T. M. Derwing, ‘‘Foreign accent, compre-
hensibility, and intelligibility in the speech of second language
learners,’Lang. Learn.,49, Suppl. 1, 285–310 (1999).
[22] Y. Kachru and L. E. Smith, Cultures, Context, and World
Englishes (Routledge, New York, 2008).
[23] T. Derwing and M. J. Munro, ‘‘What speaking rates do non-
native listeners prefer?,’Appl. Linguist.,22, 324–337 (2001).
[24] K. Harwood and F. Cartier, ‘‘On definition of listenability,’
South. Speech J.,18, 20–23 (1952).
[25] S. Pearson, H. Moran, K. Hata and F. Holm, ‘‘Combining
concatenation and formant synthesis for improved intelligibil-
ity and naturalness in text-to-speech systems,’Proc. 2nd
ISCA/IEEE Workshop Speech Synthesis, pp. 69–72 (1994).
[26] K. Tanaka, T. Toda, G. Neubig, S. Sakti and S. Nakamura, ‘‘An
inter-speaker evaluation through simulation of electrolarynx
control based on statistical F0 prediction,’Proc. Annu. Summit
and Conf. Signal and Information Processing (APSIPA),4
pages (2014).
[27] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and
T. Kitamura, ‘‘Speech parameter generation algorithms for
HMM-based speech synthesis,’Proc. ICASSP 2000, pp. 1315–
1318 (2000).
[28] S. Takaki, K. Sawada, K. Hashimoto, K. Oura and K. Tokuda,
‘Overview of NITECH HMM-based speech synthesis system
for Blizzard Challenge 2013,’Proc. Blizzard Challenge 2013,
6 pages (2013).
[29] A. Nakamura, N. Seiyama, A. Imai, T. Takagi and E.
Miyasaka, ‘‘A new approach to compensate degeneration of
speech intelligibility for elderly listeners-development of a
portable real time speech rate conversion system,’IEEE
Trans. Broadcast.,42, 285–293 (1996).
[30] K. Shudo, T. Narahara and S. Yoshida, ‘‘Morphological aspect
of Japanese language processing,’Proc. 8th Conf. Computa-
tional Linguistics, pp. 1–8 (1980).
[31] K. Takagi and K. Ozeki, ‘‘Pause information for dependency
analysis of read Japanese sentences,’Proc. Eurospeech 2001,
pp. 1041–1044 (2001).
Hafiyan Prafianto was born in Jakarta, Indo-
nesia in 1989. He received the B.E. degree from
Tokyo Institute of Technology, Japan in 2011
and the M.E. degree from Tohoku University,
Sendai, Japan, in 2013. He is currently a Ph.D.
candidate in the Graduate School of Engineer-
ing, Tohoku University.
Takashi Nose received the B.E. degree in
electronic information processing, from Kyoto
Institute of Technology, Kyoto, Japan, in 2001.
He received the Dr.Eng. degree in information
processing from Tokyo Institute of Technology,
Tokyo, Japan, in 2009. He was a Ph.D.
researcher of the 21st Century Center Of
Excellence (COE) program and the Global
COE program in 2006 and 2007, respectively.
He was an intern researcher at ATR spoken language communication
Research Laboratories (ATR-SLC) in 2008. From 2009 to 2013, he
was an assistant professor of the Interdisciplinary Graduate School of
Science and Engineering, Tokyo Institute of Technology, Yokohama,
Japan. He is currently a lecturer of the Graduate School of
Engineering, Tohoku University, Sendai, Japan. He is a member of
IEEE, ISCA, IPSJ, and ASJ. His research interests include speech
synthesis, speech recognition, speech analysis, and spoken dialogue
system.
Yuya Chiba received the B.E., M.E. and
Ph.D. degrees in engineering from Tohoku
University, Miyagi, Japan in 2010, 2012, and
2016. He is currently an Assistant Professor of
the Graduate School of Engineering, Tohoku
University, Japan. His research interests include
spoken dialog system, multi-modal information
processing, and human interface. He received
IEICE ISS Young Researcher’s Award in
Speech Field in 2014. He is a member of ISCA, IEICE, and ASJ.
Akinori Ito was born in Yamagata, Japan in
1963. He received the B.E., M.E. and Ph.D.
degrees from Tohoku University, Sendai, Japan,
in 1984, 1986 and 1992 respectively. He is now
a Professor of Graduate School of Engineering,
Tohoku University. He has engaged in spoken
language processing, music information proc-
essing and multimodal signal processing. He is
a member of the Acoustical Society of Japan,
the Information Processing Society Japan, Human Interface Society
and the IEEE.
Acoust. Sci. & Tech. 39, 2 (2018)
100
... However, in practice, the effectiveness of synthetic voice for use in stations has not been verified. For example, Tachibana [24] emphasized the importance of speech intelligibility in noisy environments, and there have been many studies on the optimal volume and speech rate of broadcasts in public spaces such as airports and train stations [1,2,24] and under high noise levels [25][26][27][28][29][30][31][32][33]. However, it is difficult to immediately apply these findings to the environments in train stations because the acoustic characteristics of the sound environment can differ depending on the location. ...
... Indeed, many measurements of noise levels in railway stations have already been carried out [34,35], and Bandyopadhyay et al. [36] suggested that the sound pressure levels (SPLs) of BGN and loudspeaker sound on the platform can cause significant discomfort to users, as they are largely above acceptable daytime noise levels. Furthermore, although the Architectural Institute of Japan Environmental Standard [37] sets a speech rate of 5.5 mora/s, which is derived from the results of a study using broadcasts at this rate, there are many other studies that mention the possibility that the comprehensibility of information broadcasts may change depending on the speech rate [2,32,33]. Under these circumstances, the development of clear voice transmission that is efficient and does not cause discomfort in the sound environment of station premises is essential. ...
... They are listed in the Barrier-Free Improvement Guidelines of the Ministry of Land, Infrastructure and Transport [38] as places where acoustic facilities should be provided at railway stations, and noise level surveys have been conducted for both locations [39]. Previous studies [2,32,33] have also reported appropriate guidelines for the speech rate of announcements in public spaces in Japan. In order to investigate the effects of SNR and speech rate on auditory impression separately, the appropriate SNR range was first found by setting the speech rate considered appropriate beforehand, that is, without considering the acoustic characteristics of the space. ...
Article
Full-text available
An experimental study on the effect of the speech characteristics of the signal-to-noise ratio (SNR) and speech rate on the intelligibility of announcements at railway stations was conducted using an artificial synthetic voice. Synthesized speech has recently been used in noisy environments both indoors and outdoors, but unlike its use in quiet environments, when the environment is noisy, the intelligibility of announcements may be reduced. For railway station announcements, while natural spoken voices are currently used for multilingual announcements and disaster response broadcasts, deep neural network synthesized voices, which use deep learning, have also been adopted. However, the effect of the acoustic characteristics such as the SNR and speech rate on the intelligibility of reproduced announcements in noisy public spaces such as railway stations has not yet been clarified from a practical viewpoint. In this paper, in order to determine the appropriate SNR and speech rate for synthetic voice announcements in railway stations, auditory impressions of announcements with varying SNR and speech rate were evaluated by participants using a five-point scale. Based on the evaluations, the appropriate conditions for the broadcast of synthetic voice announcements at the ticket gate and on the platform of a station are discussed.
... Furthermore, appropriate speech speed varies depending on the target audience. A typical example is that non-native speakers prefer slow speech compared to native speakers [22]. To avoid communication discrepancies, the speech rate must be adjusted according to the situation. ...
Article
In oral communication, especially public speaking, it is essential to speak at a speed that is appropriate for the situation. However, speech control requires substantial training. Although several speech-training systems that provide automated feedback on users' speech quality or behavior have been developed, users are still required to consciously control their ways of speaking to improve their speech. This study proposes a speech-supporting system that enables users to speak at a pace close to the target rate with minimum conscious adjustment. Because auditory feedback on the speaker's voice with a short delay disturbs speaking, we used auditory feedback with continuously varying delay times to slow the speaker's speech rate when speaking fast. We implemented a prototype and conducted a user study with ten speakers to confirm the effects on the speech style of the speaker during public speaking. The results show that the proposed system can slow the speech rate when the user speaks quickly without instructing the speaker on how to respond to auditory feedback. The findings also suggest that the proposed system causes less discomfort to the speaker than delayed auditory feedback with a constant delay.
... We name this dataset SpeedSpeech-JA-2022. Table 1 lists the exact speaking rate measured by mora per second. This was computed by dividing the mora number by the voiced length of each utterance [5,35]. The voiced frames were detected by using voice activity detection 6 . ...
Preprint
Full-text available
This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A signal resampling method and an image scaling method are implemented in the proposed method to warp the mel-spectrograms or hidden features of the neural vocoder. We also design and open-source a Japanese speech corpus containing three kinds of speaking rates to evaluate the proposed speaking rate control method. Experimental results of comprehensive objective and subjective evaluations demonstrate that 1) the proposed method outperforms a baseline time-scale modification algorithm in speech naturalness, 2) warping mel-spectrograms by image scaling obtained the best performance among all proposed methods, and 3) the proposed speaking rate control method can be incorporated into HiFi-GAN without losing computational efficiency.
Article
Speech-rate conversion technology, which can expand or compress speech waveforms while preserving the pitch of the sound, is traditionally realized by signal-processing-based approaches. To improve the synthesis quality, this paper proposes a machine-learning-based approach using neural vocoders, to perform neural speech-rate conversion. The proposed approach introduces a multispeaker WaveNet vocoder trained with a multispeaker corpus. Speech-rate conversion for many and unspecified speakers, not included in the training data, is realized by resampling acoustic features or hidden features along the time direction in inference. In experiments, the multispeaker WaveNet vocoder was trained using the JVS corpus and two types of resampling methods were compared. Conventional WSOLA and STRAIGHT were also compared as signal-processing-based baselines. The test sets included Japanese speaker corpora for the monolingual condition, and an English multispeaker corpus (CMU ARCTIC) for the cross-lingual condition. The results of the experiments demonstrate that the proposed approach with resampling of hidden features can achieve higher quality speech-rate conversion than the conventional methods, in both monolingual and cross-lingual conditions, except for speakers with low fundamental frequency in conversion of fast speech.
Article
Full-text available
Collaborating using a common language can be challenging for non-native speakers (NNS). These challenges can be reduced when native speakers (NS) adjust their speech behavior for NNS, for example by speaking more slowly. In this study, we examined whether the use of real-time speech rate feedback (a speech speedometer) would help NS monitor their speaking speed and adjust for NNS accordingly. We conducted a laboratory experiment with 20 triads of 2 NS and 1 NNS. NS in half of the groups were given the speech speedometer. We found that NS with the speech speedometer were significantly more motivated to slow down their speech but they did not actually speak more slowly, although they made other speech adjustments. Furthermore, NNS perceived the speech of NS with the speedometer less clear, and they felt less accommodated. The results highlight the need for tools that create scaffolding to help NS make speech accommodations. We conclude with some design ideas for these scaffolding tools.
Conference Paper
Full-text available
In these days, many foreigners visit Japan from many countries for sightseeing, education or getting jobs. In Japan, they have many natural disasters such as earthquake, tsunami, flood or eruption. When a disaster occurs, an authority is responsible to inform of various information for evacuation or life under a refugee camp to not only Japanese native speakers but also non-native speakers. However, multilingual announcement is not a realistic way of achieving this because of the limitation of human resource under such an emergency situation. Under this background, " Easy Japanese " (EJ) has been proposed. EJ is a constrained language, which consists of limited vocabulary and grammar. Thanks to the linguistic limitation, a sentence in EJ is more easily understood than that in ordinary Japanese by non-native speakers of Japanese. Because of its easiness to use, "Easy Japanese" has been widely used in public office web site, leaflet or broadcast. The problem is that composing sentences in EJ needs training, because a Japanese native speaker cannot understand what words or phrases are difficult to understand for non-native speakers. To make composition in EJ easier, we developed a system that helps a writer to compose sentences in EJ without knowledge of the limitation of EJ. The system is named YANSIS, which stands for "YAsashii Nihongo SIen System " (Easy Japanese writing support system). YANSIS consists of six components. The UI component provides user interface such as Japanese text input and buttons. The Japanese morphological analyzer component is used to split an input sentence into words. We need this component because a Japanese sentence does not have spaces between words. The Japanese level analyzer component determines difficulty level of each word based on the vocabularies of Japanese language proficiency test (JLPT). The recommendation component finds out difficult phrases in the input sentence and recommends how to rewrite them. The Japanese easiness estimator component estimates difficulty of the whole sentence based on machine learning technique. The example search component searches examples of EJ sentences that are related to a word in the input sentence. The first version of YANSIS was implemented in Java, and thus it runs on any OS if Java runs on it, such as Windows, Linux or Mac OS X. In addition, we ported YANSIS to Android and iOS. Because Android apps are based on Java, we could reuse many components of the original version in the Android version, except the UI component. However, for Java is not available on iOS, we needed to re-implement all components including Japanese morphological analyzer using Objective-C.
Conference Paper
Full-text available
In this paper, we developed a method to assess easiness of a Japanese sentence for a non-native speaker of Japanese. This method is intended to be used as a writing aid of Easy Japanese (EJ), which is used as a language to convey information to foreigners in Japan under emergency condition such as earthquake. We examined six features (number of words, nouns, verbs, phrases, and loanwords in the sentence, and the average grade of words in the sentence) as features, and used linear regression model to combine the features. As a result of evaluation experiment, we obtained correlation coefficient of 0.55 between the predicted scores and the easiness scores given by human subjects.
Article
In this paper we discuss the possibility of a one type of Japanese language, so-called Yasashii Nihongo (Easy Japanese, EJ), which has been introduced by linguists and others in Japan. It consists of limited vocabulary and simplified grammar. Inspired by tragedy for foreign residents in Japan subjected to natural disasters, some community media outlets are interested in assimilating information in EJ. In radio broadcasting, broadcasters feel difficulties of applying EJ to their broadcasting due to unusual oral characteristics and slight contradictions with the broadcaster's programming policy. The structure of EJ itself is also questioned on its lexical and grammatical characteristics. While respecting the spirit of EJ for matching the needs of foreign residents for easy-to-understand natural disaster information, this paper also proposes that the creators of EJ conduct a more detailed survey for improving the rules and to promote the practical skills of professionals in community media. Permalink : http://hdl.handle.net/10236/1840
Article
This volume aims to familiarize readers with the varieties of world Englishes used across cultures and to create awareness of some of the linguistic and socially relevant contexts and functions that have given rise to them. It emphasizes that effective communication among users of different Englishes requires awareness of the varieties in use and their cultural, social, and ideational functions. Cultures, Contexts and World Englishes: Demonstrates the rich results of integrating theory, methodology and application, Features critical and detailed discussion of the sociolinguistics of English in the globalized world, Gives equal emphasis to grammar and pragmatics of variation and to uses of Englishes in spoken and written modes in major English-using regions of the world. Each chapter includes suggestions for further reading and challenging discussion questions and appropriate research projects designed to enhance the usefulness of this volume in courses such as world Englishes, English in the Global Context, Sociolinguistics, Critical Applied Linguistics, Language Contact and Convergence, Ethnography of Communication, and Crosscultural Communication.
Article
An electrolarynx is a device that artificially generates excitation sounds to produce electrolaryngeal (EL) speech. Although proficient laryngectomees can produce intelligible EL speech by using this device, it sounds quite unnatural due to the mechanical excitation. To address this issue, we have proposed several EL speech enhancement methods using statistical voice conversion and showed that statistical prediction of excitation parameters, such as F0 patterns, was essential to significantly improve naturalness of EL speech. Based on this result, we have also proposed a direct control method of F0 patterns of excitation sounds generated from the electrolarynx based on the statistical excitation prediction, which may allow EL speech enhancement to be applied to face-to-face conversation. In our previous work, this direct control method was evaluated through simulation using only a single laryngectomee's EL speech and it was demonstrated that this method allows for improved naturalness of EL speech while preserving listenability. However, because quality of EL speech highly depends on the proficiency of each laryngectomee, it is still not clear whether these methods will generalize to other speakers. In addition, while previous work only evaluated the naturalness and listenability, intelligibility is also an important factor that has not been evaluated. In this paper, we apply the direct control method to multiple speakers consisting of two real laryngectomees and one non-laryngectomee and evaluate its performance through simulations in terms of naturalness, listenability, and intelligibility. The experimental results demonstrate that the proposed method yields significant improvements in naturalness of EL speech for multiple laryngectomees while maintaining listenability and intelligibility.
Article
One of the chief goals of most second language learners is to be understood in their second language by a wide range of interlocutors in a variety of contexts. Although a nonnative accent can sometimes interfere with this goal, prior to the publication of this study, second language researchers and teachers alike were aware that an accent itself does not necessarily act as a communicative barrier. Nonetheless, there had been very little empirical investigation of how the presence of a nonnative accent affects intelligibility, and the notions of “heavy accent” and “low intelligibility” had often been confounded. Some of the key findings of the study—that even heavily accented speech is sometimes perfectly intelligible and that prosodic errors appear to be a more potent force in the loss of intelligibility than phonetic errors—added support to some common, but weakly substantiated beliefs. The study also provided a framework for a program of research to evaluate the ways in which such factors as intelligibility and comprehensibility are related to a number of other dimensions. The authors have extended and replicated the work begun in this study to include learners representing other L1 backgrounds (Cantonese, Japanese, Polish, Spanish) and different levels of learner proficiency, as well as other discourse types (Derwing & Munro, 1997; Munro & Derwing, 1995). Further support for the notion that accent itself should be regarded as a secondary concern was obtained in a study of processing difficulty (Munro & Derwing, 1995), which revealed that nonnative utterances tend to require more time to process than native-produced speech, but failed to indicate a relationship between strength of accent and processing time.The approach to L2 speech evaluation used in this study has also proved useful in investigations of the benefits of different methods of teaching of pronunciation to ESL learners. In particular, it is now clear that learner assessments are best carried out with attention to the multidimensional nature of L2 speech, rather than with a simple focus on global accentedness. It has been shown, for instance, that some pedagogical methods may be effective in improving intelligibility while others may have an effect only on accentedness (Derwing, Munro, & Wiebe, 1998).
Article
The tendency of pronunciation of the Japanese affricate tsu spoken by second language (L2) learners, specially Koreans, by means of evaluation testing, was analyzed. Thirteen native speakers of Korean language were selected to study how received native Japanese sounds uttered by Korean learners were perceived. A total of 252 sounds, nine types of sound pronounced by the 13 Korean learners and one native speaker of Japanese language, were used and these stimuli were randomly grouped into six sets. The stimuli were presented to evaluators at a comfortable listening level through headphones in a quite room. The phonetic value of the fricative part of tsu pronounced by Koreans differs from tsu pronounced by Japanese according to the spectral comparison. Results show that tsu as medial and final morae uttered by Korean learners were perceived as chu or su by native speakers of Japanese language.