Content uploaded by Qinglin Meng
Author content
All content in this area was uploaded by Qinglin Meng on Mar 25, 2018
Content may be subject to copyright.
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017 641
Loudness Contour Can Influence Mandarin Tone
Recognition: Vocoder Simulation and
Cochlear Implants
Qinglin Meng, Nengheng Zheng,
Member, IEEE
, and Xia Li,
Member, IEEE
Abstract
—Lexical tone recognition with current cochlear
implants (CI) remains unsatisfactory due to significantly
degraded pitch-related acoustic cues, which dominate the
tone recognition by normal-hearing (NH) listeners. Sev-
eral secondary cues (e.g., amplitude contour, duration,
and spectral envelope) that influence tone recognition in
NH listeners and CI users have been studied. This work
proposes a loudness contour manipulation algorithm,
namely Loudness-Tone (L-Tone), to investigate the effects
of loudness contour on Mandarin tone recognition and
the effectiveness of using loudness cue to enhance tone
recognition for CI users. With L-Tone, the intensity of sound
samples is multiplied by gain values determined by instanta-
neous fundamental frequencies (
F0s
) and pre-defined gain-
F0
mapping functions. Perceptual experiments were con-
ducted with a four-channel noise-band vocoder simulation
in NH listeners and with CI users. The results suggested
that 1) loudness contour is a useful secondary cue for
Mandarin tone recognition, especially when pitch cues are
significantly degraded; 2) L-Tone can be used to improve
Mandarin tone recognition in both simulated and actual
CI-hearing without significant negative effect on vowel and
consonant recognition. L-Tone is a promising algorithm for
incorporation into real-time CI processing and off-line CI
rehabilitation training software.
Index Terms
—Cochlear implant, loudness contour,
Mandarin tone recognition, pitch.
I. INTRODUCTION
CONTEMPORARY multi-channel cochlear implant (CI)
systems can provide some lexical tone recognition capa-
bility for patients [1], [2], although most clinically available
CI strategies preserve only temporal envelopes [3] of the
Manuscript received December 29, 2015; revised June 1, 2016;
accepted July 12, 2016. Date of publication July 20, 2016; date of current
version June 18, 2017. This work was supported by the China Post-
doctoral Science Foundation (2015M572360), Guangdong Natural Sci-
ence Foundation (2014A030313557), Shenzhen Key Laboratory Project
(CXB201105060068A), and a fund awarded by the China Scholarship
Council (201308440223).
Corresponding author: N. Zheng
(e-mail:
nhzheng@szu.edu.cn).
Q. Meng is with the Acoustic Lab, School of Physics and Optoelectron-
ics, South China University of Technology (SCUT), Guangzhou 510641,
China and also with the College of Information Engineering, Shenzhen
University, Shenzhen 518060, China.
N. Zheng is with the Shenzhen Key Laboratory of Modern Communi-
cation and Information Processing, College of Information Engineering,
Shenzhen University, Shenzhen 518060, China. He was with the Univer-
sity of New South Wales, NSW 2052, Australia.
X. Li is with the Shenzhen Key Lab of Modern Communication and
Information Processing, College of Information Engineering, Shenzhen
University, Shenzhen 518060, China.
Digital Object Identifier 10.1109/TNSRE.2016.2593489
channel signals and are not customized for tonal languages [4].
However, the performance of lexical tone recognition with CIs
is significantly worse than that of normal hearing [5], [6],
moderately impaired hearing [7], and even vocoder-simulated
CI hearing [8].
The poor tone recognition performance with CIs has been
attributed primarily to the inadequate representation of pitch
cues in the electric stimulation. The frequency resolution of a
CI device is mostly determined by the number of electrodes
implanted (no more than 24 and far less than the number of
auditory filters in a normal cochlea), which results in poor
spectral pitch coding in CIs. However, it was found that the
temporal pitch coding may be partially preserved in the CI
processing and the corresponding coding cue is the periodicity
information in the temporal envelope of the electric stimuli
on each electrode [9]. Periodicity enhancement of the stimuli
and its potential benefits to improve voice pitch perception,
including lexical tone recognition, have been investigated.
For instance, periodicity can be enhanced by increasing the
amplitude modulation depth [by subtracting 0.4×compressed
slow envelope (<50 Hz) from the compressed fast enve-
lope (<400 Hz)] [10], or by amplitude-modulating the tempo-
ral envelopes using saw-tooth or sinusoidal waveforms at the
electrodes [11]–[15]. Besides, some recent studies proposed to
substitute frequency-downshift operation for classic temporal
envelope extraction in CI strategies [16]–[18]. They hypothe-
sized that this substitution could improve harmonic structure
or temporal fine structure representation which is useful for
pitch discrimination. Although these methods showed certain
potential improvements on CI pitch perception, the application
of such methods in commercial CI devices remains limited by
1) the capability of electric temporal pitch, 2) the engineering
difficulties in real-time and real-life realization [14], and
3) the risk of spectral information distortion caused by the
enhancement operations [11].
In addition to the primary pitch cues, secondary acoustic
cues such as amplitude contour (depending on its correlation
degree with the F0 contour), syllable duration (e.g., Tone 3 is
generally the longest one among the four tones of Mandarin),
and spectral envelope (e.g., for whispered Mandarin speech)
have also been shown to be useful for lexical tone recog-
nition [9], [19]–[21]. In these studies, the contributions of
the secondary cues were revealed using stimuli with sig-
nificantly degraded pitch cues (including fine structures in
both temporal and spectral domains [21]–[23]). These results
1534-4320 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
642 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017
implied that some secondary cues might be useful for lexical
tone perception with CIs, in which pitch cues are usually
inadequately transmitted to patients. Luo and Fu [24] proposed
a method, known as Enhanced-Tone (E-Tone), to modify
the temporal amplitude contour to resemble the F0 contour,
and significant benefits for Mandarin tone recognition were
obtained with a four-channel vocoder CI simulation in normal-
hearing listeners.
How could amplitude-contour modification influence
lexical-tone recognition, which in linguists’ view is a pitch-
related or F0 -related task? In this study, we introduce
“loudness contour” as a middle layer to bridge the gap
between the subjective perception of tone and the objective
measurement of amplitude. As we know, the four tones in
Mandarin are characterized by F0 contour patterns (within
individual monosyllabic word) [25]; and musical melodies
are intrinsically dominated by the relative pitch which is
also coded by the F0 (between sequential notes of a musi-
cal instrument) [26]. Recent psychoacoustical studies have
revealed an interaction between pitch and loudness for melody
recognition. For example, McDermott et al. [27] found that our
auditory system has a common feature in the representations
of contours in pitch, loudness, and timbre; Cousineau et al.
and Luo et al. [28], [29] found an interaction between pitch
and loudness contour among CI users. In [29], normal-hearing
listeners showed better performance on pitch discrimination
than on loudness discrimination, whereas CI users showed
comparable performance on both tasks, and co-varying both
cues was suggested to be useful for melody recognition with
CIs. Similar effects of loudness on lexical tone discrimination
were also mentioned in some studies on simulations of the
modulation excitation patterns for lexical tones vocoded with
an eight-band tone excited vocoder [30], [31]. They showed
that rising and flat lexical tones differ in terms of slow and
fast amplitude modulation cues that are related to loudness
increase at the end of rising lexical tones.
The goal of this study is to further investigate the effects of
loudness contour on Mandarin tone recognition and the poten-
tial of using loudness cue to enhance tone recognition for CI
users. A loudness manipulation algorithm, namely Loudness-
Tone (L-Tone), is proposed to manipulate the instantaneous
loudness of sound using instantaneous pitch values. Instead of
manipulating the temporal envelopeas done by E-Tone in [24],
L-Tone directly manipulates channel signals using predefined
F0 -gain functions, such that the intensity change (in dB)
rather than the intensity itself is determined by the instan-
taneous F0. Discussions about the relation between L-Tone
and E-Tone are provided in detail in Sec II.A. We hypothe-
sized that the loudness contour manipulated by L-Tone can
derive tone discrimination, and L-Tone might be able to
improve tone recognition ability of CI users without significant
negative effects on speech intelligibility. To examine such
hypotheses, Mandarin monosyllable recognition experiments
were carried out in both vocoder-stimulated normal-hearing
listeners and actual CI listeners. There were three tasks
including tone recognition, initial consonant recognition, and
final vowel recognition. The consonant and vowel recognition
tasks were conducted to monitor the effect of L-Tone on
Fig. 1. Block diagram of the L-Tone algorithm.
speech intelligibility. All monosyllables were processed by
L-Tone. To observe the conflicting and co-varying effects
of loudness contour and pitch contour on tone recognition,
loudness contour of the stimuli was modified in both forward
and reverse directions along with the F0 contour. To observe
the band effects on the effectiveness of L-Tone, the stimuli
were processed by L-Tone with three band conditions, i.e., all
bands being modified, only high-frequency bands (>1250 Hz)
being modified, and only low-frequency bands (<1250 Hz)
being modified. The normal-hearing simulation group (seven
subjects in quiet) was tested with all three band conditions,
whereas the CI group (four subjects in quiet and four subjects
in noise) was tested only with the “all band condition” to
save time.
II. METHODS
A. Loudness Manipulation Algorithm (L-Tone)
With L-Tone, the intensity of each sample in the voiced
portions is multiplied by a gain value determined by the
instantaneous F0. A mapping function is used to generate a
loudness gain contour in a forward or reverse direction along
with the F0 contour. The unvoiced frames remain unchanged.
Fig. 1 illustrates the signal processing procedure of L-Tone.
The original signal x(t)(digitized at a sampling rate of 16 kHz
in this study) is passed through a set of Gammatone filters.
In this study, the filter spacing and the bandwidth of each
filter were respectively set to be 0.35 and 1.019 equivalent
rectangular bandwidth (ERB) [32]. Given the frequency range
of 80-7999 Hz for the target speech, there were 84 filters
in total. For the voiced portion of x(t),F0 values are first
computed frame-by-frame with a frame length of 16 ms and no
inter-frame overlap, and then an F0 contour F0 (t) is generated
by interpolation. In this study, a cubic spline interpolation was
applied, and the F0 values were computed using the auto-
correlation method (ac) and the cross-correlation method (cc)
in Praat software [33] and then manually screened from the
Praat results. An intensity gain function for the kth Gamma-
tone filter output, Gk(t), is derived from the interpolated F0
contour by
Gk(t)=F0(t)−min(F0(t))
100 +1αk
(1)
where min(F0 (t)) denotes an operation of computing the
minimum F0 of the overall signal, and αkdenotes the loudness
manipulation factor for the kth channel. The output of the
MENG
et al.
: LOUDNESS CONTOUR CAN INFLUENCE MANDARIN TONE RECOGNITION: VOCODER SIMULATION AND COCHLEAR IMPLANTS 643
Fig. 2. The mapping function between the F0 difference and the sound
level gain.
kth Gammatone filter is multiplied by Gk(t). That is, its sound
level is increased, or decreased with negative αk,by
20 log10(Gk(t)) =20αklog10 F0(t)−min(F0(t))
100 +1
dB.
(2)
Finally, all level-modified outputs from the Gammatone filters
are individually root-mean-square (RMS) equalized to retain
the same RMS as the original outputs from the Gammatone fil-
ters and then combined to resynthesize a loudness-manipulated
audio signal.
(2) shows that for a given monosyllabic signal and a prede-
fined factor αk, the sound level gain (in dB) for the kth channel
is dependent on the difference between the instantaneous F0
and the minimum F0 of the target word. Fig. 2 demonstrates
such dependency with different αk(the subscript kis omitted
in the figure). We can see that α=0 represents no adjustment
on the sound level, a positive αrepresents an increasing sound
level gain as F0 increases, a negative αrepresents a decreasing
sound level gain as F0 increases, and a larger αindicates a
larger dynamic range of the sound level adjustment. For exam-
ple, for a speech sample with an F0 of min(F0 (t)) +100 Hz,
the sound level gains are −24.0, −12.0, 0.0, 12.0, and 24.0 dB
for αk=−4,−2,0,2, and 4, respectively.
As an example, the loudness manipulation for the tenth filter
output (with α10 =−4 and 4) is demonstrated in Fig. 3,in
which the Hilbert envelopes of the original and the manipu-
lated signals are given for comparison. The target monosylla-
ble is /bá/ (a rising tone). As illustrated, the envelope of the
signal modified with α10 =4 shows a more rising trend, and
that with α10 =−4 shows a falling trend, compared to that
of the original envelope (i.e., the solid black line).
As another example, Fig. 4 demonstrates three elec-
trodograms of a monosyllable /ba/ (a falling-rising tone)
processed by L-Tone with α1−84 =−4, 0, and 4, respectively.
As demonstrated, in comparison to the electrodogram with
α1−84 =0 (middle, no loudness manipulation), more energy is
distributed to the left and right ends in the electrodogram with
α1−84 =4 (bottom, more falling and more rising), which may
Fig. 3. A demonstration of loudness manipulation by L-Tone.
The signal (middle grey line) is the tenth Gammatone filter (center
frequency = 219.06 Hz) output of a Tone 2 token /bá/ (, pull) spoken by
a female.
enhance the identification of Tone 3 in the sense of loudness
contour rather than the pitch contour. In contrast, more energy
is distributed horizontally to the center with α1−84 =−4
(top, tends to be rising-falling or flat).
The study of L-Tone was inspired by the E-Tone algorithm
proposed in Luo and Fu [24]. In that study, the authors showed
some advantages for tone recognition with four noise-band
vocoders in normal-hearing listeners by making the amplitude
contour shape more similar to the F0 contour shape. In order
to quantize the “similarity” between the two different physical
quantities (i.e., pressure and frequency), the F0 contour was
first calibrated to hold the same RMS as the temporal envelope
(i.e., the amplitude contour; calculated by measuring the RMS
of the input signal on a frame-by-frame basis), and then the
temporal envelope was partially or fully substituted by the
calibrated F0 contour through a linear weighting function
(see [24, eq. (1)–(2)]).
Both L-Tone and E-Tone produce signals whose amplitude
contours are changed according to the F0 contour. Never-
theless, there are two major differences between the two
algorithms. First, with L-Tone, the input signal is multiplied by
a gain function determined by the instantaneous F0. This mul-
tiplication operation, compared to the linear weighting with
E-Tone, gives the gain of intensity (or loudness) in dB a clear
mapping relationship to the F0 (as demonstrated in Fig. 2).
With E-Tone, this mapping is ambiguous, as E-Tone pays
more attention to the shape-similarity between the amplitude
contour and the F0 contour than the mapping relationship
between the loudness variation and F0. Second, the attack and
release time information of the signal may be destroyed by
E-Tone, because the sudden change of F0 (between unvoiced
and voiced portions or between voiced and unvoiced portions)
may result in a zero attack or zero release time in the modified
stimuli. L-Tone can avoid this problem to some extent, because
the intensities at the two ends of a sound are rising from or
falling to a small value or even zero and their multiplication
by the gain values will also derive small values.
B. Perceptual Testing 1: Vocoder Simulation in
Normal-Hearing Listeners
The following three tasks were tested: Mandarin tone identi-
fication (T), initial consonant recognition (C), and final vowel
644 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017
Fig. 4. Electrodogram demonstrations of L-Tone. The target speech
is a female /ba/. a) α1−84 = 4, representing the reverse condition;
b) α1−84=0, representing no modification; c) α1−84 = 4 representing
the forward condition. We drew this figure using the Nucleus MATLAB
Toolbox 4.20 software.
recognition (V). Tasks C and V were used to preliminarily
evaluate whether L-Tone introduces any negative effects on
speech intelligibility.
1) Subjects:
Seven normal-hearing college students (S1-S6
and S8) participated in this experiment. They are native
Mandarin speakers, but they grew up in the South of China,
where the people are not as good at Mandarin speaking as the
people in the North of China. They provided written informed
consent before the experiment and were paid hourly.
Fig. 5. F0 contour onset values and the
F0
dynamic ranges of all
64 tokens used for tone recognition task (F: female; M: male). The
dynamic range was calculated by the largest
F0
difference in the voiced
part of a single word of speech. The “max” and “min” denote operations
of computing maximum and minimum, respectively.
2) Materials:
Task T: Mandarin monosyllabic words were
derived from the advanced tone test module of the AngelTest
software (emilyshannonfufoundation.org). Speech data from
these words were collected from four speakers (two males and
two females), each producing four tones for the monosyllables
(initial consonant: /b/; final vowels: /a/, /i/, /o/, and /u/),
yielding a total of 64 tokens (4 speakers ×4 tones ×4
vowels). The onset values and dynamic ranges of the F0
contours for all tokens are illustrated in Fig. 5. We can see
that Tone 4 (the falling tone) has the largest dynamic range
(187–300 Hz for female; 127–189 Hz for male), and Tone
1 (the high-flat tone) has the least dynamic ranges (mostly
8–43 Hz). The dynamic ranges of Tone 2 (the rising tone) and
Tone 3 (falling-rising) are within 47–125 Hz.
Tasks C and V: Mandarin monosyllabic words were derived
from the basic consonant and vowel test modules (each with
24 stimuli) of AngelTest, including six consonant groups and
six vowel groups. The consonant groups are as follows: 1. /pí/,
/lí/, /qí/, xí; 2. /gˇu/, /hˇu/, /zhˇu/, /wˇu/; 3. /m¯ao/, /d¯ao/, /ch¯ao/,
/y¯ao/; 4. /gˇou/, /kˇou/, /shˇou/, /zˇou/; 5. /jì/, /rì/, /cì/, /sì/; and
6. /fù/, /tù/, /nù/, /bù/. The six vowel groups are as follows:
1. /chá/, /chái/, /chán/, /chún/; 2. /shé/, /shí/, /sháo/, /shéng/;
3. /yá/, /yáng/, /yú/, /yíng/; 4. /mò/, /mù/, /mèi/, /miè/; 5. /gu¯ı/,
/g¯ou/, /g¯en/, /g¯ong/; and 6. /qiú/, /qué/, /qín/, /qún/.
3) Test Conditions (Algorithm Parameters):
L-Tone was
employed with the following three frequency band conditions:
all 84 bands were modified (A: 80–7999 Hz. α1–84 =−4,
−2, 0, 2, and 4), only No. 43-84 high-frequency bands
were modified (H: 1250–7999 Hz. α43–84 =−4,−2,2,
and 4.α
1–42 =0), and only No. 1–42 low-frequency bands
were modified (L: 80–1250 Hz. α1–42 =−4,−2, 2, and
4.α
43–84 =0), yielding a total of 13 test conditions. For
condition A, all monosyllabic words in tasks T, C, and V
MENG
et al.
: LOUDNESS CONTOUR CAN INFLUENCE MANDARIN TONE RECOGNITION: VOCODER SIMULATION AND COCHLEAR IMPLANTS 645
were processed by L-Tone to generate the loudness-modified
sounds, whereas for conditions H and L, only monosyllabic
words in task T were processed in order to save time.
4) Noise-Excited Channel Vocoder:
A four-band noise-
excited channel vocoder was employed to simulate the signal
processing of CIs [24], [34]. The loudness-modifiedsound was
first pre-emphasized through a first-order Butterworth high-
pass filter with a cut-off frequency of 1200 Hz. Then, it was
split into frequency bands (eighth-order Butterworth band-
pass filters) with corner frequencies of 80.0, 424.0, 1250.1,
3234.1, and 7999.0 Hz. The envelope in each band was
extracted using an eighth-order Butterworth low-pass filter
with a cut-off frequencyof 400 Hz after full-wave rectification.
The same filter bank was applied to a white Gaussian noise
to generate four band-limited white noises, each of which
was subsequently amplitude-modulated by the corresponding
envelope extracted from the sub-band speech. The modulated
noises were again processed with the corresponding band-pass
filter, and the sum of the modulated noises was used as the
speech stimulus for the listeners.
5) Psychophysical Procedure:
Three test sessions, two with
vocoder simulation and one with non-vocoder processing,
were administered. Each of the vocoder-simulated sessions
contained two training tests (each lasted for 8–10 min) with
condition A.α1–84 =0, and a formal test with all of the
13 conditions abovementioned. For each session of each
subject, the 13 conditions were conducted in a random order.
The arithmetic means of the two vocoder sessions were
recorded as final results. After the two vocoder sessions,
a non-vocoder test session was conducted as a control, in
which the loudness-modified sounds (with three conditions,
i.e., A.α1−84 =−4, A.α1−84 =0, and A.α1−84 =4), without
vocoder processing, were used as the stimuli for the listeners.
With each condition of each session, the order of the test tasks
was also randomized. There were 64, 24, and 24 one-interval
four-alternative forced-choice trials for the T, C, and V tasks,
respectively. For T task, subjects pressed 1–4 number keys
on a keyboard or clicked 1–4 number buttons on a screen
to select the identified tone. For C and V tasks, subjects
selected (by clicking on the screen) the target word among
the four words from corresponding word group as listed in
Section II-B2. All stimuli were presented at a comfortable
level (approximately 70 dBA) via a Roland Quad-Capture
UA-55 audio interface and a Sennheiser HD 650 headset in
an anechoic chamber. No correctness feedback was provided
to the subjects during the experiments.
C. Perceptual Testing 2: Cochlear Implant Users
1) Subjects:
Four unilateral CI users (C1, C2, C4, and C6)
and two bilateral CI users (C12 and C13) participated in
this experiment. C1 and C12 are female and the others are
male. All of them are native Mandarin speakers, but most
of them (except C2 who is from Henan Province) grew up
in families from Guangdong Province, where the people are
not as good at Mandarin speaking as the people in the North
of China. All unilateral CI subjects were tested with clean
speech (denoted by C1Q, C2Q, C4Q, and C6Q). Their details
TABLE I
CHARACTERISTICS OF COCHLEAR IMPLANT USERS TESTED IN QUIET
TABLE II
CHARACTERISTICS OF COCHLEAR IMPLANT USERS TESTED IN NOISE
are shown in Table I. C2, C6, C12, and C13 were tested with
white-noise-corrupted speech (denoted by C2N, C6N, C12N,
and C13N). Their details are shown in Tab l e II. To avoid floor
effects of tone recognition, the signal-to-noise ratios (SNRs)
for C2N, C6N, C12N, and C13N were respectively set to
−5, 10, 5 and −10 dB, such that moderate tone recognition
accuracies (approximately 70% in a pilot session) could be
assured. To minimize the effect of incorrect F0,F0 contours
were extracted from the clean speech. All CI users provided
written informed consent before the experiment.
2) Materials:
The same tasks (T, C, and V) and materials as
in the vocoder simulation experiments were used.
3) Test Conditions (Algorithm Parameters):
L-Tone was
employed with only one frequency band condition, i.e., con-
dition A as in Section II-B. As will be demonstrated by the
results in the next section, condition A generally showed better
performance on tone recognition than both H and L in the
vocoder simulation experiment. Therefore, band condition H
and L were not included in CI tests for the patients’ sake
of time. All monosyllabic words in tasks T, C, and V were
processed using L-Tone to generate the loudness-modified
646 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017
sounds, which were directly used as the stimuli for the CI
subjects.
4) Implant Signal Processing:
All subjects’ CI devices were
incorporated with their daily-used strategies. Specifically, the
strategies were advanced combination encoder (ACE) for the
Cochlear devices [35], advanced peak selection (APS) strategy
for the Nurotron device [36], and FS4-p strategy for the
MED-EL devices [37]. ACE and APS are n-of-mstrategies,
which extract temporal envelopes from outputs of mbandpass
filters and sequentially select nchannels having the largest
energy to stimulate the nerves. FS4-p strategy is one of
the MED-EL fine structure processing strategies which were
proposed to preserve some temporal fine structure in the lowest
2 to 4 channels.
5) Psychophysical Procedure:
For C1Q, C2Q, C4Q, and
C2N, two test sessions were administered and the average
of the results of each session were calculated as the final
results, whereas for the sake of time, only one test session
was administered for C6Q, C6N, C12N, and C13N. Each test
session contained five test conditions (i.e., A.α1−84 =−4,−2,
0, 2, and 4) in a random order. Before each test session, there
were two training sessions with condition A.α1−84 =0 (each
lasted for 8–10 min). Under each condition of all sessions,
the order of T, C, and V was also randomized, and the psy-
chophysical procedure was the same as the vocoder simulation
experiment. All stimuli were presented at a comfortable level
(approximately 70 dBA) via a Roland Quad-Capture UA-55
audio interface and a YAMAHA HS8 loudspeaker, which is
approximately one meter in front of the subject’s head in an
anechoic chamber.
III. RESULTS
A. Results of Normal-Hearing Listeners
For the non-vocoded session, all subjects showed perfect
results (100% accuracy) on all tasks, indicating no effect
of loudness manipulation on non-vocoded speech for normal
hearing.
Vocoder simulation results for tone recognition are shown
in Fig. 6. Recognition accuracies, including the means and
standard deviations over the subjects, for all 13 test condi-
tions are plotted. A two-way repeated-measures analysis of
variance (RM-ANOVA) was used to examine the effects of
the manipulation factors (i.e., α=4,2,0,2,and4)andthe
band conditions (i.e., A, H, and L). The analyses showed that
1) the main effect of αwas significant [F(4,24) =59.47,
p<0.001]; 2) the main effect of the band condition was
not significant [F(2,12) =3.35, p=0.07]; and 3) the
interaction effect between αand the band condition was
significant [F(8,48) =21.80, p<0.001].
A one-way RM-ANOVA was used to analyze the effects
of αunder each band condition. The mean tone recognition
scores increased significantly as αincreased for all band
conditions [A: F(4,24) =85.13, p<0.001; H: F(4,24)
=40.67, p<0.001; L: F(4,24) =11.61, p<0.001].
The significance of the pairwise difference between each
L-Tone modified condition (i.e., α= 4, 2, 2, and 4) and the
unmodified condition (i.e., α=0) are also illustrated by the
Fig. 6. Means of percent-correct scores of tone recognition (i.e., task T)
with vocoder simulation under the three band conditions A, H, and L.
The error bars show one standard derivation. The significance of the
pairwise difference between each L-Tone modified condition (i.e., α=4,
2, 2, and 4) and the unmodified condition (i.e., α=0) are also illustrated
by the asterisk symbols: * 0.01 <p<0.05, ** 0.005 <p<0.01, *** p <
0.005.
asterisks in Fig. 6. We can see that 1) for band condition
both A, positive αvalues showed significant benefits and both
negative αvalues showed significant negative effects; 2) for
band condition H, α=4 showed significant benefits, α=2
showed insignificant effect, and both negative αvalues showed
significant negative effects; 3) for band condition L, α=2
and 2 showed insignificant effect and α= 4 and 4 showed
significant negative and positive effects respectively.
A one-way RM-ANOVA was used to further analyze the
effects of the band condition (i.e., A, H, and L) under each
modified condition (i.e., α= 4, 2, 2, and 4). Under the
condition of α= 4 and 2, the mean results showed a consistent
and significant (p<0.05) trend of A <H<L. Under the
condition of α=4 and 2, the mean results with A were always
significantly (p<0.05) greater than those with H and L, and
the mean results with H and those with L were comparable
(p>0.05).
The means of consonant recognition (task C) accuracies
and vowel recognition (task V) accuracies are illustrated
in Fig. 7. No significant effect of αwas found either
in task C [F(4,24) =1.76, p=0.171] or in task V
[F(4,24) =1.98, p=0.131], although large individual vari-
ations were found.
B. Results of Cochlear Implant Users
The experimental results for CI users are shown in Fig. 8.
Accuracies of tone recognition in quiet and in noise, including
their means and standard deviations over the subjects, for
different αare plotted in panel (a) and (b) of Fig. 8.One-
way RM-ANOVA showed that the main effect of αwas
significant for both CI experiments in quiet [F(4,12) =25.00,
p<0.001] and in noise [F(4,12) =12.40, p<0.001], sug-
gesting that the tone recognition accuracy tended to increase
as αincreased. Pairwise comparisons showed that compared
with the un-modification condition (i.e., α=0), positive α
MENG
et al.
: LOUDNESS CONTOUR CAN INFLUENCE MANDARIN TONE RECOGNITION: VOCODER SIMULATION AND COCHLEAR IMPLANTS 647
Fig. 7. Means of percent-correct scores of initial consonant recognition
(i.e., task C; left) and final vowel (i.e., task V; right) with vocoder
simulation. The error bars show one standard derivation.
Fig. 8. Means of percent-correct scores of tone recognition (i.e., task
T), initial consonant recognition (i.e., task C; left), and final vowel (i.e.,
task V; right) with CIs. The error bars show one standard derivation.
a) Mean T scores of 4 CI users in quiet; b) Mean C and V scores of
4 CI users in quiet; c) Mean T scores of 4 CI users in noise; d) Mean C
and V scores of four CI users in noise. The significance of the pairwise
difference between each L-Tone modified condition (i.e., α=4,2,2,and
4) and the unmodified condition (i.e., α=0) are also illustrated by the
asterisk symbols: * 0.01 <p<0.05, ** 0.005 <p<0.01, *** p <0.005.
usually derived significant higher recognition accuracy and
negative αusually derived significant lower recognition accu-
racy (p<0.05). The only two exceptions are the pair of
α=2and0(p=0.55) in quiet and the pair of α=0
and 2 (p=0.32) in noise, within neither of which a signif-
icant difference was found. The first exception suggests an
advantage of a strong manipulation (e.g., α=4) over a weak
manipulation (e.g., α=2) on CI tone recognition in quiet.
The mean accuracies for consonant recognition (task C) and
vowel recognition (task V) by CI users in quiet and in noise
are illustrated in panel (c) and (d) of Fig. 8. No significant
effect of αwas found in task C [F(4,12) =0.54, p=0.72]
and task V [F(4,12) =0.23, p=0.91] with CI users in
quiet. No significant effect of αwas found in task C with
Fig. 9. Tone recognition confusion matrixes for vocoder simulation (top),
CIs in quiet (middle), and CIs in noise (bottom) for α=−4(left), 0 (middle),
and 4 (right) with band condition A. Unit: %.
CI users in noise [F(4,12) =0.53, p=0.72]. However, a
significant effect of αwas found in task V with CI users in
noise [F(4,12) =6.86, p=0.004]. Pairwise comparisons
showed that: α= 2 derived a significantly (p=0.04) higher
vowel recognition accuracy than α=0, which could be
explained by learning effect introduced by a ignored mistake
during the condition randomization that α= 2 was mostly
conducted after α=0 for the CI test in noise; α=4
derived near-significantly (p=0.09) lower vowel recognition
accuracy than α=0, which provide a pilot evidence on the
advantage of using a weak manipulation (e.g, α=2) over a
strong manipulation (e.g., α=4) for CI vowel recognition
in noise.
C. Conflicting and Co-varying Effects of Loudness and
Pitch on Tone Recognition
To further illustrated the interaction effect of loudness
contour and pitch contour on Mandarin tone recognition in
the experiments, confusion matrixes for the tone recognition
results with α= 4 =, 0, and 4 and with all-band modification
are given in Fig. 9. The condition of α= 4 was designed
for observing the conflicting effects of loudness and pitch,
whereas the condition of α=4 was designed for observing
their co-varying effects. The three rows from top to bottom
represent vocoder simulations, CIs in quiet, and CIs in noise,
respectively. We can see that 1) the co-varying condition
concentrated the results to the diagonals, that is, more correct
tone identifications were achieved; 2) the conflicting condition
showed opposite effects, e.g., more Tone 4 (the falling tone)
and Tone 3 (the rising-falling tone) were identified as Tone 2
(the rising tone), more Tone 2 was identified as Tone 4, and
more un-flat tones (including Tone 2, 3, and 4) were identified
as the flat tone (i.e., Tone 1). These results show some evi-
dences for loudness contour’s contribution on Mandarin tone
648 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017
recognition. The co-varying results indicate the effectiveness
L-Tone on Mandarin tone enhancement for CIs.
IV. CONCLUSIONS AND DISCUSSIONS
Amplitude contour (or temporal envelope) was suggested as
a secondary cue for lexical tone recognition in previous stud-
ies [9], [19]– [21], [24]. In this paper, we argue that the effect
of amplitude contour can be explained by the contribution of
loudness contour. In other words, the variation in loudness
perception may induce a perception of lexical tone. The
loudness’s contribution to tone identification was evaluated
by both vocoder simulation and actual CI experiments using
the proposed L-Tone algorithm. In the vocoder experiment,
three band conditions, i.e., all bands being modified, only high-
frequency bands (>1250 Hz) being modified, and only low-
frequency bands (<1250 Hz) being modified, were applied
to examine the band effect on the effectiveness of L-Tone.
Results suggest that the high-frequency components make a
relatively greater contribution to the effectiveness of L-Tone
than the low-frequency components. Nevertheless, simultane-
ously manipulating the whole frequency bands resulted in
better performance than manipulating only high-frequency
bands. Therefore, full band manipulation is suggested for full
exploration of the benefits of L-Tone.
Using L-Tone with positive α, i.e., the co-varying condition
of loudness and pitch, the perceptual tests in both simulated
and actual CI-hearing listeners showed significant improve-
ment on tone recognition. For CIs in quiet, significant gains
on tone recognition accuracy can be achieved by the strong
manipulation (i.e., α=4), but not the weak manipulation
(i.e., α=2). For CIs in noise, significant gains on tone
recognition accuracy can be achieved by both weak and
strong manipulations. However, the strong manipulation is not
recommended due to its negative effect on vowel recognition,
as shown in Sec. III. B. These results indicate that the α
value should be adjusted according to differentSNR conditions
when applying L-Tone in CIs.
Although promising results have been obtained, there are
some limits in this study. Firstly, the number of CI subject in
this study was relatively small and individual variations were
large. Secondly, the issues of prelingually and postlingually
deafness, years of CI experience, and age could not be
considered due to the insufficient subjects, and more work
need to be done to optimize L-Tone for different types of CI
users. Thirdly, the intelligibility tasks (i.e., the consonant and
vowel recognition) using four-alternative forced-choice trials
were easy for the CI users (especially for the CIs in quiet)
and the possible side-effect of L-Tone on speech indelibility
could not be fully revealed.
In both experiments of vocoder simulation and actual CI
users, the subjects sometimes identified the tones according
to the loudness contour rather than the original pitch contour,
especially for the conflicting condition (e.g., α=4),which
is well illustrated in Fig. 9. This suggests that, using both
stimuli from vocoder and CIs, where the primary pitch cues
are inadequately transmitted, the secondary contribution of
loudness cue on tone recognition can be revealed. In the future,
L-Tone can be used to study the interaction between pitch
and loudness cues on lexical tone recognition and other voice
pitch perception tasks (e.g., intonation recognition). Although
our experiments were carried out using Mandarin materials,
L-Tone is also suggested for other tonal languages. It should
be noted that the Mandarin tone contrasts differ mainly on the
F0 contour, but several languages (e.g., Cantonese) use both
F0 contour and register. Whether loudness contrasts between
Cantonese words can be identified as tone contrasts, and if so,
how to define the F0 -gain function, requires further study.
As for the real-time implementation, L-Tone can be directly
incorporated into CI strategies. The Gammatone filterbank
can be substituted by the default filterbank in CI strategy.
The “min” operation in (1), which is a non-casual operation,
should be substituted by a constant value, e.g., 50 Hz. Addi-
tionally, the loudness manipulation can be performed in the
pre-processing stage or after the bandpass filtering. L-Tone
could also be used for rehabilitation training. For example,
it can be incorporated into computer-aided speech training
software to manipulate the sound materials. The loudness
manipulation factor αcan be gradually adjusted during the
training from a high positive value to zero or to a negative
value if necessary. The L-Tone with positive αcan exaggerate
the contrast between different tones and consequently make the
contour cues easier to identify. The L-Tone with negative αcan
degrade the original loudness contour and the pitch contour
discrimination ability can be examined without the influence
of loudness contour. This method may be useful for tone
recognition training, especially in the first several months of
speech training after implant activation.
ACKNOWLEDGMENT
The authors would like to thank all subjects who partici-
pated in this study. They are grateful to X. Zhang, F. Qiu,
T. Heng, H. Zeng, and L. Wang from Shenzhen Univ., Y. Cai
from Sun Yat-sen Univ., G. Yu and D. Rao from SCUT, L. Yin,
L. Ping, and G. Tang from Nurotron Company, H. Mou from
Chinese Academy of Sciences, and T. Guan and S. Liu from
Tsinghua Univ. for their help during the experiments.
REFERENCES
[1] A. Li, N. Wang, J. Li, J. Zhang, and Z. Liu, “Mandarin lexical tones
identification among children with cochlear implants or hearing aids,”
Int. J. Pediatric Otorhinolaryngol., vol. 78, no. 11, pp. 1945–1952, 2014.
[2] C.-M. Wu, T.-C. Liu, N.-M. Wang, and W.-C. Chao, “Speech perception
and communication ability over the telephone by Mandarin-speaking
children with cochlear implants,” Int. J. Pediatric Otorhinolaryngol.,
vol. 77, no. 8, pp. 302–1295, 2013.
[3] B. S. Wilson et al., “Better speech recognition with cochlear implants,”
Nature, vol. 352, pp. 236–238, Jul. 1991.
[4] P. Loizou, “Speech processing in vocoder-centric cochlear implants,”
Adv. Otorhinolaryngol., vol. 64, pp. 109–143, 2006.
[5] D. Han et al., “Lexical tone perception with hiresolution and hiresolution
120 sound-processing strategies in pediatric Mandarin-speaking cochlear
implant users,” Ear Hear., vol. 30, no. 2, pp. 77–169, 2009.
[6] W. Wang, N. Zhou, and L. Xu, “Musical pitch and lexical tone percep-
tion with cochlear implants,” Int. J. Audiol., vol. 50, no. 4, pp. 270–278,
2011.
[7] V. Ciocca, A. L. Francis, R. Aisha, and L. Wong, “The perception
of Cantonese lexical tones by early-deafened cochlear implantees,”
J. Acoust. Soc. Am., vol. 111, no. 5, pp. 2250–2256, 2002.
MENG
et al.
: LOUDNESS CONTOUR CAN INFLUENCE MANDARIN TONE RECOGNITION: VOCODER SIMULATION AND COCHLEAR IMPLANTS 649
[8] C.-G. Wei, K. Cao, and F.-G. Zeng, “Mandarin tone recognition in
cochlear-implant subjects,” Hear. Res., vol. 197, nos. 1–2, pp. 87–95,
2004.
[9] Q.-J. Fu, F.-G. Zeng, R. V. Shannon, and S. D. Soli, “Importance of
tonal envelope cues in Chinese speech recognition,” J. Acoust. Soc. Am.,
vol. 104, no. 1, pp. 505–510, 1998.
[10] L. Geurts and J. Wouters, “Coding of the fundamental frequency
in continuous interleaved sampling processors for cochlear implants,”
J. Acoust. Soc. Am., vol. 109, no. 2, pp. 713–726, 2001.
[11] T. Green, A. Faulkner, S. Rosen, and O. Macherey, “Enhancement of
temporal periodicity cues in cochlear implants: Effects on prosodic
perception and vowel identification,” J. Acoust. Soc. Am., vol. 118, no. 1,
pp. 375–385, 2005.
[12] T. Green, A. Faulkner, and S. Rosen, “Enhancing temporal cues to voice
pitch in continuous interleaved sampling cochlear implants,” J. Acoust.
Soc. Am., vol. 116, no. 4, pp. 2298–2310, 2004.
[13] M. Milczynski, J. E. Chang, J. Wouters, and A. van Wieringen,
“Perception of Mandarin Chinese with cochlear implants using enhanced
temporal pitch cues,” Hear. Res., vol. 285, nos. 1–2, pp. 1–12, 2012.
[14] T. Lee, S. Yu, M. Yuan, T. K. C. Wong, and Y.-Y. Kong, “The effect of
enhancing temporal periodicity cues on Cantonese tone recognition by
cochlear implantees,” Int. J. Audiol., vol. 53, no. 8, pp. 546–557, 2014.
[15] A. E. Vandali and R. J. M. van Hoesel, “Development of a temporal
fundamental frequency coding strategy for cochlear implants,” J. Acoust.
Soc. Am., vol. 129, no. 6, pp. 4023–4036, 2011.
[16] X. Li et al., “Improved perception of speech in noise and Mandarin tones
with acoustic simulations of harmonic coding for cochlear implants,”
J. Acoust. Soc. Am., vol. 132, no. 5, pp. 3387–3398, 2012.
[17] X. Li, K. Nie, N. S. Imennov, J. T. Rubinstein, and L. E. Atlas,
“Improved perception of music with a harmonic based algorithm for
cochlear implants,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 21,
no. 4, pp. 684–694, Jul. 2013.
[18] Q. Meng, N. Zheng, and X. Li, “Mandarin speech-in-noise and tone
recognition using vocoder simulations of the temporal limits encoder for
cochlear implants,” J. Acoust. Soc. Am., vol. 139, no. 1, pp. 301–310,
2016.
[19] D. H. Whalen and Y. Xu, “Information for Mandarin tones in the
amplitude contour and in brief segments,” Phonetica, vol. 49, no. 1,
pp. 25–47, 1992.
[20] Q.-J. Fu and F.-G. Zeng, “Identification of temporal envelope cues in
Chinese tone recognition,” Asia Pacific J. Speech, Lang. Hear.,vol.5,
no. 1, pp. 45–57, 2000.
[21] Y.-Y. Kong and F.-G. Zeng, “Temporal and spectral cues in Mandarin
tone recognition,” J. Acoust. Soc. Am., vol. 120, no. 5, pp. 2830–2840,
2006.
[22] L. Xu, Y. Tsai, and B. E. Pfingst, “Features of stimulation affecting tonal-
speech perception: Implications for cochlear prostheses,” J. Acoust. Soc.
Am., vol. 112, no. 1, pp. 247–258, 2002.
[23] L. Xu and B. E. Pfingst, “Relative importance of temporal envelope
and fine structure in lexical-tone perception (L),” J. Acoust. Soc. Am.,
vol. 114, no. 6, pp. 3024–3027, 2003.
[24] X. Luo and Q.-J. Fu, “Enhancing Chinese tone recognition by manipulat-
ing amplitude envelope: Implications for cochlear implants,” J. Acoust.
Soc. Am., vol. 116, no. 6, pp. 3659–3667, 2004.
[25] S. Duanmu, The Phonology of Standard Chinese. London, U.K.: Oxford
Univ. Press, 2002, pp. 225–253.
[26] J. Plantinga and L. J. Trainor, “Memory for melody: Infants use a relative
pitch code,” Cognition, vol. 98, no. 1, pp. 1–11, 2005.
[27] J. H. McDermott, A. J. Lehr, and A. J. Oxenham, “Is relative pitch
specific to pitch?” Psychol. Sci., vol. 19, no. 12, pp. 1263–1271, 2008.
[28] M. Cousineau, L. Demany, B. Meyer, and D. Pressnitzer, “What breaks a
melody: Perceiving F0 and intensity sequences with a cochlear implant,”
Hearing Res., vol. 269, nos. 1–2, pp. 34–41, 2010.
[29] X. Luo, M. E. Masterson, and C.-C. Wu, “Contour identification with
pitch and loudness cues using cochlear implants,” J. Acoust. Soc. Am.,
vol. 135, no. 1, pp. EL8–EL14, 2014.
[30] L. Cabrera, F.-M. Tsao, D. Gnansia, J. Bertoncini, and C. Lorenzi, “The
role of spectro-temporal fine structure cues in lexical-tone discrimination
for French and Mandarin listeners,” J. Acoust. Soc. Am., vol. 136, no. 2,
pp. 877–882, 2014.
[31] L. Cabrera et al., “The perception of speech modulation cues in lexical
tones is guided by early language-specific experience,” Front. Psychol.,
vol. 6, p. 1290, 2015.
[32] M. Brookes. (2014). Voicebox: Speech Processing Toolbox for
MATLAB. [Online]. Available: http://www.ee.ic.ac.uk/hp/staff/dmb/
voicebox/voicebox.html
[33] P. Boersma and D. Weenink. (2014). Praat: Doing phonetics by Com-
puter (Version 5. 3. 79). [online]. Available: http://www.praat.org/
[34] R. V. Shannon, F.-G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid,
“Speech recognition with primarily temporal cues,” Science, vol. 270,
no. 5234, pp. 303–304, 1995.
[35] A. E. Vandali, L. A. Whitford, K. L. Plant, and G. M. Clark, “Speech
perception as a function of electrical stimulation rate: Using the nucleus
24 cochlear implant system,” Ear Hear., vol. 21, no. 6, pp. 608–624,
2000.
[36] F. G. Zeng et al., “Development and evaluation of the nurotron
26-electrode cochlear implant system,” Hear. Res., vol. 322,
pp. 188–199, Apr. 2015.
[37] D. Riss et al., “FS4, FS4-p, and FSP: A 4-month crossover study of
3 fine structure sound-coding strategies,” Ear Hear., vol. 35, no. 6,
pp. e272–e281, 2014.
Qinglin Meng receivedtheB.S.degreeinelec-
tronic information science from Harbin Engineer-
ing University, Harbin China, in 2008, and the
Ph.D. degree in signal processing from the Insti-
tute of Acoustics, Chinese Academy of Sciences,
Beijing, China, in 2013.
From 2013 to 2016, he was a postdoc
researcher at the College of Information Engi-
neering, Shenzhen University, China. He is cur-
rently a lecturer at the School of Physics and
Optoelectronics, South China University of Tech-
nology, China. His research focuses on cochlear implant technology,
psychoacoustics, and physiological acoustics.
Nengheng Zheng (M’06) received the
B.S. degree in electronic engineering and
the M.S. degree in acoustics from Nanjing
University, Nanjing, China, in 1997 and 2002,
respectively, and the Ph.D. degree in electronic
engineering from the Chinese University of
Hong Kong, Hong Kong SAR of China, in 2006.
He is currently an Associate Professor at the
College of Information Engineering, Shenzhen
University, China. From 2014 to 2015, he was
a visiting scholar at the School of Electrical
Engineering and Telecommunications, University of New South Wales,
Australia. His research focuses on speech and audio signal processing
for human and machine perceptions.
Xia Li (M’08) was born in 1968. She received the
B.S. degree in electronics engineering and the
M.S. degree in signal and information processing
from Xidian University, Xi’an, China, and the
Ph.D. degree in information engineering from the
Chinese University of Hong Kong, in 1997.
She is currently a Professor with the College
of Information Engineering, Shenzhen University,
China. Her current research interests include
computational intelligence, image processing,
and pattern recognition.