ArticlePDF Available

Loudness Contour Can Influence Mandarin Tone Recognition: Vocoder Simulation and Cochlear Implants

Authors:

Abstract and Figures

Lexical tone recognition with current cochlear implants (CI) remains unsatisfactory due to significantly degraded pitch-related acoustic cues, which dominate the tone recognition by normal-hearing (NH) listeners. Several secondary cues (e.g., amplitude contour, duration, and spectral envelope) that influence tone recognition in NH listeners and CI users have been studied. This work proposes a loudness contour manipulation algorithm, namely Loudness-Tone (L-Tone), to investigate the effects of loudness contour on Mandarin tone recognition and the effectiveness of using loudness cue to enhance tone recognition for CI users. With L-Tone, the intensity of sound samples is multiplied by gain values determined by instantaneous fundamental frequencies (F0s) and pre-defined gain-F0 mapping functions. Perceptual experiments were conducted with a four-channel noise-band vocoder simulation in NH listeners and with CI users. The results suggested that 1) loudness contour is a useful secondary cue for Mandarin tone recognition, especially when pitch cues are significantly degraded; 2) L-Tone can be used to improve Mandarin tone recognition in both simulated and actual CI-hearing without significant negative effect on vowel and consonant recognition. L-Tone is a promising algorithm for incorporation into real-time CI processing and off-line CI rehabilitation training software.
Content may be subject to copyright.
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017 641
Loudness Contour Can Influence Mandarin Tone
Recognition: Vocoder Simulation and
Cochlear Implants
Qinglin Meng, Nengheng Zheng,
Member, IEEE
, and Xia Li,
Member, IEEE
Abstract
Lexical tone recognition with current cochlear
implants (CI) remains unsatisfactory due to significantly
degraded pitch-related acoustic cues, which dominate the
tone recognition by normal-hearing (NH) listeners. Sev-
eral secondary cues (e.g., amplitude contour, duration,
and spectral envelope) that influence tone recognition in
NH listeners and CI users have been studied. This work
proposes a loudness contour manipulation algorithm,
namely Loudness-Tone (L-Tone), to investigate the effects
of loudness contour on Mandarin tone recognition and
the effectiveness of using loudness cue to enhance tone
recognition for CI users. With L-Tone, the intensity of sound
samples is multiplied by gain values determined by instanta-
neous fundamental frequencies (
F0s
) and pre-defined gain-
F0
mapping functions. Perceptual experiments were con-
ducted with a four-channel noise-band vocoder simulation
in NH listeners and with CI users. The results suggested
that 1) loudness contour is a useful secondary cue for
Mandarin tone recognition, especially when pitch cues are
significantly degraded; 2) L-Tone can be used to improve
Mandarin tone recognition in both simulated and actual
CI-hearing without significant negative effect on vowel and
consonant recognition. L-Tone is a promising algorithm for
incorporation into real-time CI processing and off-line CI
rehabilitation training software.
Index Terms
Cochlear implant, loudness contour,
Mandarin tone recognition, pitch.
I. INTRODUCTION
CONTEMPORARY multi-channel cochlear implant (CI)
systems can provide some lexical tone recognition capa-
bility for patients [1], [2], although most clinically available
CI strategies preserve only temporal envelopes [3] of the
Manuscript received December 29, 2015; revised June 1, 2016;
accepted July 12, 2016. Date of publication July 20, 2016; date of current
version June 18, 2017. This work was supported by the China Post-
doctoral Science Foundation (2015M572360), Guangdong Natural Sci-
ence Foundation (2014A030313557), Shenzhen Key Laboratory Project
(CXB201105060068A), and a fund awarded by the China Scholarship
Council (201308440223).
Corresponding author: N. Zheng
(e-mail:
nhzheng@szu.edu.cn).
Q. Meng is with the Acoustic Lab, School of Physics and Optoelectron-
ics, South China University of Technology (SCUT), Guangzhou 510641,
China and also with the College of Information Engineering, Shenzhen
University, Shenzhen 518060, China.
N. Zheng is with the Shenzhen Key Laboratory of Modern Communi-
cation and Information Processing, College of Information Engineering,
Shenzhen University, Shenzhen 518060, China. He was with the Univer-
sity of New South Wales, NSW 2052, Australia.
X. Li is with the Shenzhen Key Lab of Modern Communication and
Information Processing, College of Information Engineering, Shenzhen
University, Shenzhen 518060, China.
Digital Object Identifier 10.1109/TNSRE.2016.2593489
channel signals and are not customized for tonal languages [4].
However, the performance of lexical tone recognition with CIs
is significantly worse than that of normal hearing [5], [6],
moderately impaired hearing [7], and even vocoder-simulated
CI hearing [8].
The poor tone recognition performance with CIs has been
attributed primarily to the inadequate representation of pitch
cues in the electric stimulation. The frequency resolution of a
CI device is mostly determined by the number of electrodes
implanted (no more than 24 and far less than the number of
auditory filters in a normal cochlea), which results in poor
spectral pitch coding in CIs. However, it was found that the
temporal pitch coding may be partially preserved in the CI
processing and the corresponding coding cue is the periodicity
information in the temporal envelope of the electric stimuli
on each electrode [9]. Periodicity enhancement of the stimuli
and its potential benefits to improve voice pitch perception,
including lexical tone recognition, have been investigated.
For instance, periodicity can be enhanced by increasing the
amplitude modulation depth [by subtracting 0.4×compressed
slow envelope (<50 Hz) from the compressed fast enve-
lope (<400 Hz)] [10], or by amplitude-modulating the tempo-
ral envelopes using saw-tooth or sinusoidal waveforms at the
electrodes [11]–[15]. Besides, some recent studies proposed to
substitute frequency-downshift operation for classic temporal
envelope extraction in CI strategies [16]–[18]. They hypothe-
sized that this substitution could improve harmonic structure
or temporal fine structure representation which is useful for
pitch discrimination. Although these methods showed certain
potential improvements on CI pitch perception, the application
of such methods in commercial CI devices remains limited by
1) the capability of electric temporal pitch, 2) the engineering
difficulties in real-time and real-life realization [14], and
3) the risk of spectral information distortion caused by the
enhancement operations [11].
In addition to the primary pitch cues, secondary acoustic
cues such as amplitude contour (depending on its correlation
degree with the F0 contour), syllable duration (e.g., Tone 3 is
generally the longest one among the four tones of Mandarin),
and spectral envelope (e.g., for whispered Mandarin speech)
have also been shown to be useful for lexical tone recog-
nition [9], [19]–[21]. In these studies, the contributions of
the secondary cues were revealed using stimuli with sig-
nificantly degraded pitch cues (including fine structures in
both temporal and spectral domains [21]–[23]). These results
1534-4320 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
642 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017
implied that some secondary cues might be useful for lexical
tone perception with CIs, in which pitch cues are usually
inadequately transmitted to patients. Luo and Fu [24] proposed
a method, known as Enhanced-Tone (E-Tone), to modify
the temporal amplitude contour to resemble the F0 contour,
and significant benefits for Mandarin tone recognition were
obtained with a four-channel vocoder CI simulation in normal-
hearing listeners.
How could amplitude-contour modification influence
lexical-tone recognition, which in linguists’ view is a pitch-
related or F0 -related task? In this study, we introduce
“loudness contour” as a middle layer to bridge the gap
between the subjective perception of tone and the objective
measurement of amplitude. As we know, the four tones in
Mandarin are characterized by F0 contour patterns (within
individual monosyllabic word) [25]; and musical melodies
are intrinsically dominated by the relative pitch which is
also coded by the F0 (between sequential notes of a musi-
cal instrument) [26]. Recent psychoacoustical studies have
revealed an interaction between pitch and loudness for melody
recognition. For example, McDermott et al. [27] found that our
auditory system has a common feature in the representations
of contours in pitch, loudness, and timbre; Cousineau et al.
and Luo et al. [28], [29] found an interaction between pitch
and loudness contour among CI users. In [29], normal-hearing
listeners showed better performance on pitch discrimination
than on loudness discrimination, whereas CI users showed
comparable performance on both tasks, and co-varying both
cues was suggested to be useful for melody recognition with
CIs. Similar effects of loudness on lexical tone discrimination
were also mentioned in some studies on simulations of the
modulation excitation patterns for lexical tones vocoded with
an eight-band tone excited vocoder [30], [31]. They showed
that rising and flat lexical tones differ in terms of slow and
fast amplitude modulation cues that are related to loudness
increase at the end of rising lexical tones.
The goal of this study is to further investigate the effects of
loudness contour on Mandarin tone recognition and the poten-
tial of using loudness cue to enhance tone recognition for CI
users. A loudness manipulation algorithm, namely Loudness-
Tone (L-Tone), is proposed to manipulate the instantaneous
loudness of sound using instantaneous pitch values. Instead of
manipulating the temporal envelopeas done by E-Tone in [24],
L-Tone directly manipulates channel signals using predefined
F0 -gain functions, such that the intensity change (in dB)
rather than the intensity itself is determined by the instan-
taneous F0. Discussions about the relation between L-Tone
and E-Tone are provided in detail in Sec II.A. We hypothe-
sized that the loudness contour manipulated by L-Tone can
derive tone discrimination, and L-Tone might be able to
improve tone recognition ability of CI users without significant
negative effects on speech intelligibility. To examine such
hypotheses, Mandarin monosyllable recognition experiments
were carried out in both vocoder-stimulated normal-hearing
listeners and actual CI listeners. There were three tasks
including tone recognition, initial consonant recognition, and
final vowel recognition. The consonant and vowel recognition
tasks were conducted to monitor the effect of L-Tone on
Fig. 1. Block diagram of the L-Tone algorithm.
speech intelligibility. All monosyllables were processed by
L-Tone. To observe the conflicting and co-varying effects
of loudness contour and pitch contour on tone recognition,
loudness contour of the stimuli was modified in both forward
and reverse directions along with the F0 contour. To observe
the band effects on the effectiveness of L-Tone, the stimuli
were processed by L-Tone with three band conditions, i.e., all
bands being modified, only high-frequency bands (>1250 Hz)
being modified, and only low-frequency bands (<1250 Hz)
being modified. The normal-hearing simulation group (seven
subjects in quiet) was tested with all three band conditions,
whereas the CI group (four subjects in quiet and four subjects
in noise) was tested only with the “all band condition” to
save time.
II. METHODS
A. Loudness Manipulation Algorithm (L-Tone)
With L-Tone, the intensity of each sample in the voiced
portions is multiplied by a gain value determined by the
instantaneous F0. A mapping function is used to generate a
loudness gain contour in a forward or reverse direction along
with the F0 contour. The unvoiced frames remain unchanged.
Fig. 1 illustrates the signal processing procedure of L-Tone.
The original signal x(t)(digitized at a sampling rate of 16 kHz
in this study) is passed through a set of Gammatone filters.
In this study, the filter spacing and the bandwidth of each
filter were respectively set to be 0.35 and 1.019 equivalent
rectangular bandwidth (ERB) [32]. Given the frequency range
of 80-7999 Hz for the target speech, there were 84 filters
in total. For the voiced portion of x(t),F0 values are first
computed frame-by-frame with a frame length of 16 ms and no
inter-frame overlap, and then an F0 contour F0 (t) is generated
by interpolation. In this study, a cubic spline interpolation was
applied, and the F0 values were computed using the auto-
correlation method (ac) and the cross-correlation method (cc)
in Praat software [33] and then manually screened from the
Praat results. An intensity gain function for the kth Gamma-
tone filter output, Gk(t), is derived from the interpolated F0
contour by
Gk(t)=F0(t)min(F0(t))
100 +1αk
(1)
where min(F0 (t)) denotes an operation of computing the
minimum F0 of the overall signal, and αkdenotes the loudness
manipulation factor for the kth channel. The output of the
MENG
et al.
: LOUDNESS CONTOUR CAN INFLUENCE MANDARIN TONE RECOGNITION: VOCODER SIMULATION AND COCHLEAR IMPLANTS 643
Fig. 2. The mapping function between the F0 difference and the sound
level gain.
kth Gammatone filter is multiplied by Gk(t). That is, its sound
level is increased, or decreased with negative αk,by
20 log10(Gk(t)) =20αklog10 F0(t)min(F0(t))
100 +1
dB.
(2)
Finally, all level-modified outputs from the Gammatone filters
are individually root-mean-square (RMS) equalized to retain
the same RMS as the original outputs from the Gammatone fil-
ters and then combined to resynthesize a loudness-manipulated
audio signal.
(2) shows that for a given monosyllabic signal and a prede-
fined factor αk, the sound level gain (in dB) for the kth channel
is dependent on the difference between the instantaneous F0
and the minimum F0 of the target word. Fig. 2 demonstrates
such dependency with different αk(the subscript kis omitted
in the figure). We can see that α=0 represents no adjustment
on the sound level, a positive αrepresents an increasing sound
level gain as F0 increases, a negative αrepresents a decreasing
sound level gain as F0 increases, and a larger αindicates a
larger dynamic range of the sound level adjustment. For exam-
ple, for a speech sample with an F0 of min(F0 (t)) +100 Hz,
the sound level gains are 24.0, 12.0, 0.0, 12.0, and 24.0 dB
for αk=−4,2,0,2, and 4, respectively.
As an example, the loudness manipulation for the tenth filter
output (with α10 =−4 and 4) is demonstrated in Fig. 3,in
which the Hilbert envelopes of the original and the manipu-
lated signals are given for comparison. The target monosylla-
ble is /bá/ (a rising tone). As illustrated, the envelope of the
signal modified with α10 =4 shows a more rising trend, and
that with α10 =−4 shows a falling trend, compared to that
of the original envelope (i.e., the solid black line).
As another example, Fig. 4 demonstrates three elec-
trodograms of a monosyllable /ba/ (a falling-rising tone)
processed by L-Tone with α184 =−4, 0, and 4, respectively.
As demonstrated, in comparison to the electrodogram with
α184 =0 (middle, no loudness manipulation), more energy is
distributed to the left and right ends in the electrodogram with
α184 =4 (bottom, more falling and more rising), which may
Fig. 3. A demonstration of loudness manipulation by L-Tone.
The signal (middle grey line) is the tenth Gammatone filter (center
frequency = 219.06 Hz) output of a Tone 2 token /bá/ (, pull) spoken by
a female.
enhance the identification of Tone 3 in the sense of loudness
contour rather than the pitch contour. In contrast, more energy
is distributed horizontally to the center with α184 =−4
(top, tends to be rising-falling or flat).
The study of L-Tone was inspired by the E-Tone algorithm
proposed in Luo and Fu [24]. In that study, the authors showed
some advantages for tone recognition with four noise-band
vocoders in normal-hearing listeners by making the amplitude
contour shape more similar to the F0 contour shape. In order
to quantize the “similarity” between the two different physical
quantities (i.e., pressure and frequency), the F0 contour was
first calibrated to hold the same RMS as the temporal envelope
(i.e., the amplitude contour; calculated by measuring the RMS
of the input signal on a frame-by-frame basis), and then the
temporal envelope was partially or fully substituted by the
calibrated F0 contour through a linear weighting function
(see [24, eq. (1)–(2)]).
Both L-Tone and E-Tone produce signals whose amplitude
contours are changed according to the F0 contour. Never-
theless, there are two major differences between the two
algorithms. First, with L-Tone, the input signal is multiplied by
a gain function determined by the instantaneous F0. This mul-
tiplication operation, compared to the linear weighting with
E-Tone, gives the gain of intensity (or loudness) in dB a clear
mapping relationship to the F0 (as demonstrated in Fig. 2).
With E-Tone, this mapping is ambiguous, as E-Tone pays
more attention to the shape-similarity between the amplitude
contour and the F0 contour than the mapping relationship
between the loudness variation and F0. Second, the attack and
release time information of the signal may be destroyed by
E-Tone, because the sudden change of F0 (between unvoiced
and voiced portions or between voiced and unvoiced portions)
may result in a zero attack or zero release time in the modified
stimuli. L-Tone can avoid this problem to some extent, because
the intensities at the two ends of a sound are rising from or
falling to a small value or even zero and their multiplication
by the gain values will also derive small values.
B. Perceptual Testing 1: Vocoder Simulation in
Normal-Hearing Listeners
The following three tasks were tested: Mandarin tone identi-
fication (T), initial consonant recognition (C), and final vowel
644 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017
Fig. 4. Electrodogram demonstrations of L-Tone. The target speech
is a female /ba/. a) α184 = 4, representing the reverse condition;
b) α184=0, representing no modification; c) α184 = 4 representing
the forward condition. We drew this figure using the Nucleus MATLAB
Toolbox 4.20 software.
recognition (V). Tasks C and V were used to preliminarily
evaluate whether L-Tone introduces any negative effects on
speech intelligibility.
1) Subjects:
Seven normal-hearing college students (S1-S6
and S8) participated in this experiment. They are native
Mandarin speakers, but they grew up in the South of China,
where the people are not as good at Mandarin speaking as the
people in the North of China. They provided written informed
consent before the experiment and were paid hourly.
Fig. 5. F0 contour onset values and the
F0
dynamic ranges of all
64 tokens used for tone recognition task (F: female; M: male). The
dynamic range was calculated by the largest
F0
difference in the voiced
part of a single word of speech. The “max” and “min” denote operations
of computing maximum and minimum, respectively.
2) Materials:
Task T: Mandarin monosyllabic words were
derived from the advanced tone test module of the AngelTest
software (emilyshannonfufoundation.org). Speech data from
these words were collected from four speakers (two males and
two females), each producing four tones for the monosyllables
(initial consonant: /b/; final vowels: /a/, /i/, /o/, and /u/),
yielding a total of 64 tokens (4 speakers ×4 tones ×4
vowels). The onset values and dynamic ranges of the F0
contours for all tokens are illustrated in Fig. 5. We can see
that Tone 4 (the falling tone) has the largest dynamic range
(187–300 Hz for female; 127–189 Hz for male), and Tone
1 (the high-flat tone) has the least dynamic ranges (mostly
8–43 Hz). The dynamic ranges of Tone 2 (the rising tone) and
Tone 3 (falling-rising) are within 47–125 Hz.
Tasks C and V: Mandarin monosyllabic words were derived
from the basic consonant and vowel test modules (each with
24 stimuli) of AngelTest, including six consonant groups and
six vowel groups. The consonant groups are as follows: 1. /pí/,
/lí/, /qí/, xí; 2. /gˇu/, /hˇu/, /zhˇu/, /wˇu/; 3. /m¯ao/, /d¯ao/, /ch¯ao/,
/y¯ao/; 4. /gˇou/, /kˇou/, /shˇou/, /zˇou/; 5. /jì/, /rì/, /cì/, /sì/; and
6. /fù/, /tù/, /nù/, /bù/. The six vowel groups are as follows:
1. /chá/, /chái/, /chán/, /chún/; 2. /shé/, /shí/, /sháo/, /shéng/;
3. /yá/, /yáng/, /yú/, /yíng/; 4. /mò/, /mù/, /mèi/, /miè/; 5. /gu¯ı/,
/g¯ou/, /g¯en/, /g¯ong/; and 6. /qiú/, /qué/, /qín/, /qún/.
3) Test Conditions (Algorithm Parameters):
L-Tone was
employed with the following three frequency band conditions:
all 84 bands were modified (A: 80–7999 Hz. α1–84 =−4,
2, 0, 2, and 4), only No. 43-84 high-frequency bands
were modified (H: 1250–7999 Hz. α43–84 =−4,2,2,
and 4
1–42 =0), and only No. 1–42 low-frequency bands
were modified (L: 80–1250 Hz. α1–42 =−4,2, 2, and
4
43–84 =0), yielding a total of 13 test conditions. For
condition A, all monosyllabic words in tasks T, C, and V
MENG
et al.
: LOUDNESS CONTOUR CAN INFLUENCE MANDARIN TONE RECOGNITION: VOCODER SIMULATION AND COCHLEAR IMPLANTS 645
were processed by L-Tone to generate the loudness-modified
sounds, whereas for conditions H and L, only monosyllabic
words in task T were processed in order to save time.
4) Noise-Excited Channel Vocoder:
A four-band noise-
excited channel vocoder was employed to simulate the signal
processing of CIs [24], [34]. The loudness-modifiedsound was
first pre-emphasized through a first-order Butterworth high-
pass filter with a cut-off frequency of 1200 Hz. Then, it was
split into frequency bands (eighth-order Butterworth band-
pass filters) with corner frequencies of 80.0, 424.0, 1250.1,
3234.1, and 7999.0 Hz. The envelope in each band was
extracted using an eighth-order Butterworth low-pass filter
with a cut-off frequencyof 400 Hz after full-wave rectification.
The same filter bank was applied to a white Gaussian noise
to generate four band-limited white noises, each of which
was subsequently amplitude-modulated by the corresponding
envelope extracted from the sub-band speech. The modulated
noises were again processed with the corresponding band-pass
filter, and the sum of the modulated noises was used as the
speech stimulus for the listeners.
5) Psychophysical Procedure:
Three test sessions, two with
vocoder simulation and one with non-vocoder processing,
were administered. Each of the vocoder-simulated sessions
contained two training tests (each lasted for 8–10 min) with
condition A.α1–84 =0, and a formal test with all of the
13 conditions abovementioned. For each session of each
subject, the 13 conditions were conducted in a random order.
The arithmetic means of the two vocoder sessions were
recorded as final results. After the two vocoder sessions,
a non-vocoder test session was conducted as a control, in
which the loudness-modified sounds (with three conditions,
i.e., A.α184 =−4, A.α184 =0, and A.α184 =4), without
vocoder processing, were used as the stimuli for the listeners.
With each condition of each session, the order of the test tasks
was also randomized. There were 64, 24, and 24 one-interval
four-alternative forced-choice trials for the T, C, and V tasks,
respectively. For T task, subjects pressed 1–4 number keys
on a keyboard or clicked 1–4 number buttons on a screen
to select the identified tone. For C and V tasks, subjects
selected (by clicking on the screen) the target word among
the four words from corresponding word group as listed in
Section II-B2. All stimuli were presented at a comfortable
level (approximately 70 dBA) via a Roland Quad-Capture
UA-55 audio interface and a Sennheiser HD 650 headset in
an anechoic chamber. No correctness feedback was provided
to the subjects during the experiments.
C. Perceptual Testing 2: Cochlear Implant Users
1) Subjects:
Four unilateral CI users (C1, C2, C4, and C6)
and two bilateral CI users (C12 and C13) participated in
this experiment. C1 and C12 are female and the others are
male. All of them are native Mandarin speakers, but most
of them (except C2 who is from Henan Province) grew up
in families from Guangdong Province, where the people are
not as good at Mandarin speaking as the people in the North
of China. All unilateral CI subjects were tested with clean
speech (denoted by C1Q, C2Q, C4Q, and C6Q). Their details
TABLE I
CHARACTERISTICS OF COCHLEAR IMPLANT USERS TESTED IN QUIET
TABLE II
CHARACTERISTICS OF COCHLEAR IMPLANT USERS TESTED IN NOISE
are shown in Table I. C2, C6, C12, and C13 were tested with
white-noise-corrupted speech (denoted by C2N, C6N, C12N,
and C13N). Their details are shown in Tab l e II. To avoid floor
effects of tone recognition, the signal-to-noise ratios (SNRs)
for C2N, C6N, C12N, and C13N were respectively set to
5, 10, 5 and 10 dB, such that moderate tone recognition
accuracies (approximately 70% in a pilot session) could be
assured. To minimize the effect of incorrect F0,F0 contours
were extracted from the clean speech. All CI users provided
written informed consent before the experiment.
2) Materials:
The same tasks (T, C, and V) and materials as
in the vocoder simulation experiments were used.
3) Test Conditions (Algorithm Parameters):
L-Tone was
employed with only one frequency band condition, i.e., con-
dition A as in Section II-B. As will be demonstrated by the
results in the next section, condition A generally showed better
performance on tone recognition than both H and L in the
vocoder simulation experiment. Therefore, band condition H
and L were not included in CI tests for the patients’ sake
of time. All monosyllabic words in tasks T, C, and V were
processed using L-Tone to generate the loudness-modified
646 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017
sounds, which were directly used as the stimuli for the CI
subjects.
4) Implant Signal Processing:
All subjects’ CI devices were
incorporated with their daily-used strategies. Specifically, the
strategies were advanced combination encoder (ACE) for the
Cochlear devices [35], advanced peak selection (APS) strategy
for the Nurotron device [36], and FS4-p strategy for the
MED-EL devices [37]. ACE and APS are n-of-mstrategies,
which extract temporal envelopes from outputs of mbandpass
filters and sequentially select nchannels having the largest
energy to stimulate the nerves. FS4-p strategy is one of
the MED-EL fine structure processing strategies which were
proposed to preserve some temporal fine structure in the lowest
2 to 4 channels.
5) Psychophysical Procedure:
For C1Q, C2Q, C4Q, and
C2N, two test sessions were administered and the average
of the results of each session were calculated as the final
results, whereas for the sake of time, only one test session
was administered for C6Q, C6N, C12N, and C13N. Each test
session contained five test conditions (i.e., A.α184 =−4,2,
0, 2, and 4) in a random order. Before each test session, there
were two training sessions with condition A.α184 =0 (each
lasted for 8–10 min). Under each condition of all sessions,
the order of T, C, and V was also randomized, and the psy-
chophysical procedure was the same as the vocoder simulation
experiment. All stimuli were presented at a comfortable level
(approximately 70 dBA) via a Roland Quad-Capture UA-55
audio interface and a YAMAHA HS8 loudspeaker, which is
approximately one meter in front of the subject’s head in an
anechoic chamber.
III. RESULTS
A. Results of Normal-Hearing Listeners
For the non-vocoded session, all subjects showed perfect
results (100% accuracy) on all tasks, indicating no effect
of loudness manipulation on non-vocoded speech for normal
hearing.
Vocoder simulation results for tone recognition are shown
in Fig. 6. Recognition accuracies, including the means and
standard deviations over the subjects, for all 13 test condi-
tions are plotted. A two-way repeated-measures analysis of
variance (RM-ANOVA) was used to examine the effects of
the manipulation factors (i.e., α=4,2,0,2,and4)andthe
band conditions (i.e., A, H, and L). The analyses showed that
1) the main effect of αwas significant [F(4,24) =59.47,
p<0.001]; 2) the main effect of the band condition was
not significant [F(2,12) =3.35, p=0.07]; and 3) the
interaction effect between αand the band condition was
significant [F(8,48) =21.80, p<0.001].
A one-way RM-ANOVA was used to analyze the effects
of αunder each band condition. The mean tone recognition
scores increased significantly as αincreased for all band
conditions [A: F(4,24) =85.13, p<0.001; H: F(4,24)
=40.67, p<0.001; L: F(4,24) =11.61, p<0.001].
The significance of the pairwise difference between each
L-Tone modified condition (i.e., α= 4, 2, 2, and 4) and the
unmodified condition (i.e., α=0) are also illustrated by the
Fig. 6. Means of percent-correct scores of tone recognition (i.e., task T)
with vocoder simulation under the three band conditions A, H, and L.
The error bars show one standard derivation. The significance of the
pairwise difference between each L-Tone modified condition (i.e., α=4,
2, 2, and 4) and the unmodified condition (i.e., α=0) are also illustrated
by the asterisk symbols: * 0.01 <p<0.05, ** 0.005 <p<0.01, *** p <
0.005.
asterisks in Fig. 6. We can see that 1) for band condition
both A, positive αvalues showed significant benefits and both
negative αvalues showed significant negative effects; 2) for
band condition H, α=4 showed significant benefits, α=2
showed insignificant effect, and both negative αvalues showed
significant negative effects; 3) for band condition L, α=2
and 2 showed insignificant effect and α= 4 and 4 showed
significant negative and positive effects respectively.
A one-way RM-ANOVA was used to further analyze the
effects of the band condition (i.e., A, H, and L) under each
modified condition (i.e., α= 4, 2, 2, and 4). Under the
condition of α= 4 and 2, the mean results showed a consistent
and significant (p<0.05) trend of A <H<L. Under the
condition of α=4 and 2, the mean results with A were always
significantly (p<0.05) greater than those with H and L, and
the mean results with H and those with L were comparable
(p>0.05).
The means of consonant recognition (task C) accuracies
and vowel recognition (task V) accuracies are illustrated
in Fig. 7. No significant effect of αwas found either
in task C [F(4,24) =1.76, p=0.171] or in task V
[F(4,24) =1.98, p=0.131], although large individual vari-
ations were found.
B. Results of Cochlear Implant Users
The experimental results for CI users are shown in Fig. 8.
Accuracies of tone recognition in quiet and in noise, including
their means and standard deviations over the subjects, for
different αare plotted in panel (a) and (b) of Fig. 8.One-
way RM-ANOVA showed that the main effect of αwas
significant for both CI experiments in quiet [F(4,12) =25.00,
p<0.001] and in noise [F(4,12) =12.40, p<0.001], sug-
gesting that the tone recognition accuracy tended to increase
as αincreased. Pairwise comparisons showed that compared
with the un-modification condition (i.e., α=0), positive α
MENG
et al.
: LOUDNESS CONTOUR CAN INFLUENCE MANDARIN TONE RECOGNITION: VOCODER SIMULATION AND COCHLEAR IMPLANTS 647
Fig. 7. Means of percent-correct scores of initial consonant recognition
(i.e., task C; left) and final vowel (i.e., task V; right) with vocoder
simulation. The error bars show one standard derivation.
Fig. 8. Means of percent-correct scores of tone recognition (i.e., task
T), initial consonant recognition (i.e., task C; left), and final vowel (i.e.,
task V; right) with CIs. The error bars show one standard derivation.
a) Mean T scores of 4 CI users in quiet; b) Mean C and V scores of
4 CI users in quiet; c) Mean T scores of 4 CI users in noise; d) Mean C
and V scores of four CI users in noise. The significance of the pairwise
difference between each L-Tone modified condition (i.e., α=4,2,2,and
4) and the unmodified condition (i.e., α=0) are also illustrated by the
asterisk symbols: * 0.01 <p<0.05, ** 0.005 <p<0.01, *** p <0.005.
usually derived significant higher recognition accuracy and
negative αusually derived significant lower recognition accu-
racy (p<0.05). The only two exceptions are the pair of
α=2and0(p=0.55) in quiet and the pair of α=0
and 2 (p=0.32) in noise, within neither of which a signif-
icant difference was found. The first exception suggests an
advantage of a strong manipulation (e.g., α=4) over a weak
manipulation (e.g., α=2) on CI tone recognition in quiet.
The mean accuracies for consonant recognition (task C) and
vowel recognition (task V) by CI users in quiet and in noise
are illustrated in panel (c) and (d) of Fig. 8. No significant
effect of αwas found in task C [F(4,12) =0.54, p=0.72]
and task V [F(4,12) =0.23, p=0.91] with CI users in
quiet. No significant effect of αwas found in task C with
Fig. 9. Tone recognition confusion matrixes for vocoder simulation (top),
CIs in quiet (middle), and CIs in noise (bottom) for α=4(left), 0 (middle),
and 4 (right) with band condition A. Unit: %.
CI users in noise [F(4,12) =0.53, p=0.72]. However, a
significant effect of αwas found in task V with CI users in
noise [F(4,12) =6.86, p=0.004]. Pairwise comparisons
showed that: α= 2 derived a significantly (p=0.04) higher
vowel recognition accuracy than α=0, which could be
explained by learning effect introduced by a ignored mistake
during the condition randomization that α= 2 was mostly
conducted after α=0 for the CI test in noise; α=4
derived near-significantly (p=0.09) lower vowel recognition
accuracy than α=0, which provide a pilot evidence on the
advantage of using a weak manipulation (e.g, α=2) over a
strong manipulation (e.g., α=4) for CI vowel recognition
in noise.
C. Conflicting and Co-varying Effects of Loudness and
Pitch on Tone Recognition
To further illustrated the interaction effect of loudness
contour and pitch contour on Mandarin tone recognition in
the experiments, confusion matrixes for the tone recognition
results with α= 4 =, 0, and 4 and with all-band modification
are given in Fig. 9. The condition of α= 4 was designed
for observing the conflicting effects of loudness and pitch,
whereas the condition of α=4 was designed for observing
their co-varying effects. The three rows from top to bottom
represent vocoder simulations, CIs in quiet, and CIs in noise,
respectively. We can see that 1) the co-varying condition
concentrated the results to the diagonals, that is, more correct
tone identifications were achieved; 2) the conflicting condition
showed opposite effects, e.g., more Tone 4 (the falling tone)
and Tone 3 (the rising-falling tone) were identified as Tone 2
(the rising tone), more Tone 2 was identified as Tone 4, and
more un-flat tones (including Tone 2, 3, and 4) were identified
as the flat tone (i.e., Tone 1). These results show some evi-
dences for loudness contour’s contribution on Mandarin tone
648 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 25, NO. 6, JUNE 2017
recognition. The co-varying results indicate the effectiveness
L-Tone on Mandarin tone enhancement for CIs.
IV. CONCLUSIONS AND DISCUSSIONS
Amplitude contour (or temporal envelope) was suggested as
a secondary cue for lexical tone recognition in previous stud-
ies [9], [19]– [21], [24]. In this paper, we argue that the effect
of amplitude contour can be explained by the contribution of
loudness contour. In other words, the variation in loudness
perception may induce a perception of lexical tone. The
loudness’s contribution to tone identification was evaluated
by both vocoder simulation and actual CI experiments using
the proposed L-Tone algorithm. In the vocoder experiment,
three band conditions, i.e., all bands being modified, only high-
frequency bands (>1250 Hz) being modified, and only low-
frequency bands (<1250 Hz) being modified, were applied
to examine the band effect on the effectiveness of L-Tone.
Results suggest that the high-frequency components make a
relatively greater contribution to the effectiveness of L-Tone
than the low-frequency components. Nevertheless, simultane-
ously manipulating the whole frequency bands resulted in
better performance than manipulating only high-frequency
bands. Therefore, full band manipulation is suggested for full
exploration of the benefits of L-Tone.
Using L-Tone with positive α, i.e., the co-varying condition
of loudness and pitch, the perceptual tests in both simulated
and actual CI-hearing listeners showed significant improve-
ment on tone recognition. For CIs in quiet, significant gains
on tone recognition accuracy can be achieved by the strong
manipulation (i.e., α=4), but not the weak manipulation
(i.e., α=2). For CIs in noise, significant gains on tone
recognition accuracy can be achieved by both weak and
strong manipulations. However, the strong manipulation is not
recommended due to its negative effect on vowel recognition,
as shown in Sec. III. B. These results indicate that the α
value should be adjusted according to differentSNR conditions
when applying L-Tone in CIs.
Although promising results have been obtained, there are
some limits in this study. Firstly, the number of CI subject in
this study was relatively small and individual variations were
large. Secondly, the issues of prelingually and postlingually
deafness, years of CI experience, and age could not be
considered due to the insufficient subjects, and more work
need to be done to optimize L-Tone for different types of CI
users. Thirdly, the intelligibility tasks (i.e., the consonant and
vowel recognition) using four-alternative forced-choice trials
were easy for the CI users (especially for the CIs in quiet)
and the possible side-effect of L-Tone on speech indelibility
could not be fully revealed.
In both experiments of vocoder simulation and actual CI
users, the subjects sometimes identified the tones according
to the loudness contour rather than the original pitch contour,
especially for the conflicting condition (e.g., α=4),which
is well illustrated in Fig. 9. This suggests that, using both
stimuli from vocoder and CIs, where the primary pitch cues
are inadequately transmitted, the secondary contribution of
loudness cue on tone recognition can be revealed. In the future,
L-Tone can be used to study the interaction between pitch
and loudness cues on lexical tone recognition and other voice
pitch perception tasks (e.g., intonation recognition). Although
our experiments were carried out using Mandarin materials,
L-Tone is also suggested for other tonal languages. It should
be noted that the Mandarin tone contrasts differ mainly on the
F0 contour, but several languages (e.g., Cantonese) use both
F0 contour and register. Whether loudness contrasts between
Cantonese words can be identified as tone contrasts, and if so,
how to define the F0 -gain function, requires further study.
As for the real-time implementation, L-Tone can be directly
incorporated into CI strategies. The Gammatone filterbank
can be substituted by the default filterbank in CI strategy.
The “min” operation in (1), which is a non-casual operation,
should be substituted by a constant value, e.g., 50 Hz. Addi-
tionally, the loudness manipulation can be performed in the
pre-processing stage or after the bandpass filtering. L-Tone
could also be used for rehabilitation training. For example,
it can be incorporated into computer-aided speech training
software to manipulate the sound materials. The loudness
manipulation factor αcan be gradually adjusted during the
training from a high positive value to zero or to a negative
value if necessary. The L-Tone with positive αcan exaggerate
the contrast between different tones and consequently make the
contour cues easier to identify. The L-Tone with negative αcan
degrade the original loudness contour and the pitch contour
discrimination ability can be examined without the influence
of loudness contour. This method may be useful for tone
recognition training, especially in the first several months of
speech training after implant activation.
ACKNOWLEDGMENT
The authors would like to thank all subjects who partici-
pated in this study. They are grateful to X. Zhang, F. Qiu,
T. Heng, H. Zeng, and L. Wang from Shenzhen Univ., Y. Cai
from Sun Yat-sen Univ., G. Yu and D. Rao from SCUT, L. Yin,
L. Ping, and G. Tang from Nurotron Company, H. Mou from
Chinese Academy of Sciences, and T. Guan and S. Liu from
Tsinghua Univ. for their help during the experiments.
REFERENCES
[1] A. Li, N. Wang, J. Li, J. Zhang, and Z. Liu, “Mandarin lexical tones
identification among children with cochlear implants or hearing aids,”
Int. J. Pediatric Otorhinolaryngol., vol. 78, no. 11, pp. 1945–1952, 2014.
[2] C.-M. Wu, T.-C. Liu, N.-M. Wang, and W.-C. Chao, “Speech perception
and communication ability over the telephone by Mandarin-speaking
children with cochlear implants,Int. J. Pediatric Otorhinolaryngol.,
vol. 77, no. 8, pp. 302–1295, 2013.
[3] B. S. Wilson et al., “Better speech recognition with cochlear implants,”
Nature, vol. 352, pp. 236–238, Jul. 1991.
[4] P. Loizou, “Speech processing in vocoder-centric cochlear implants,”
Adv. Otorhinolaryngol., vol. 64, pp. 109–143, 2006.
[5] D. Han et al., “Lexical tone perception with hiresolution and hiresolution
120 sound-processing strategies in pediatric Mandarin-speaking cochlear
implant users,” Ear Hear., vol. 30, no. 2, pp. 77–169, 2009.
[6] W. Wang, N. Zhou, and L. Xu, “Musical pitch and lexical tone percep-
tion with cochlear implants,” Int. J. Audiol., vol. 50, no. 4, pp. 270–278,
2011.
[7] V. Ciocca, A. L. Francis, R. Aisha, and L. Wong, “The perception
of Cantonese lexical tones by early-deafened cochlear implantees,”
J. Acoust. Soc. Am., vol. 111, no. 5, pp. 2250–2256, 2002.
MENG
et al.
: LOUDNESS CONTOUR CAN INFLUENCE MANDARIN TONE RECOGNITION: VOCODER SIMULATION AND COCHLEAR IMPLANTS 649
[8] C.-G. Wei, K. Cao, and F.-G. Zeng, “Mandarin tone recognition in
cochlear-implant subjects,Hear. Res., vol. 197, nos. 1–2, pp. 87–95,
2004.
[9] Q.-J. Fu, F.-G. Zeng, R. V. Shannon, and S. D. Soli, “Importance of
tonal envelope cues in Chinese speech recognition,” J. Acoust. Soc. Am.,
vol. 104, no. 1, pp. 505–510, 1998.
[10] L. Geurts and J. Wouters, “Coding of the fundamental frequency
in continuous interleaved sampling processors for cochlear implants,”
J. Acoust. Soc. Am., vol. 109, no. 2, pp. 713–726, 2001.
[11] T. Green, A. Faulkner, S. Rosen, and O. Macherey, “Enhancement of
temporal periodicity cues in cochlear implants: Effects on prosodic
perception and vowel identification,” J. Acoust. Soc. Am., vol. 118, no. 1,
pp. 375–385, 2005.
[12] T. Green, A. Faulkner, and S. Rosen, “Enhancing temporal cues to voice
pitch in continuous interleaved sampling cochlear implants,” J. Acoust.
Soc. Am., vol. 116, no. 4, pp. 2298–2310, 2004.
[13] M. Milczynski, J. E. Chang, J. Wouters, and A. van Wieringen,
“Perception of Mandarin Chinese with cochlear implants using enhanced
temporal pitch cues,” Hear. Res., vol. 285, nos. 1–2, pp. 1–12, 2012.
[14] T. Lee, S. Yu, M. Yuan, T. K. C. Wong, and Y.-Y. Kong, “The effect of
enhancing temporal periodicity cues on Cantonese tone recognition by
cochlear implantees,” Int. J. Audiol., vol. 53, no. 8, pp. 546–557, 2014.
[15] A. E. Vandali and R. J. M. van Hoesel, “Development of a temporal
fundamental frequency coding strategy for cochlear implants,” J. Acoust.
Soc. Am., vol. 129, no. 6, pp. 4023–4036, 2011.
[16] X. Li et al., “Improved perception of speech in noise and Mandarin tones
with acoustic simulations of harmonic coding for cochlear implants,”
J. Acoust. Soc. Am., vol. 132, no. 5, pp. 3387–3398, 2012.
[17] X. Li, K. Nie, N. S. Imennov, J. T. Rubinstein, and L. E. Atlas,
“Improved perception of music with a harmonic based algorithm for
cochlear implants,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 21,
no. 4, pp. 684–694, Jul. 2013.
[18] Q. Meng, N. Zheng, and X. Li, “Mandarin speech-in-noise and tone
recognition using vocoder simulations of the temporal limits encoder for
cochlear implants,” J. Acoust. Soc. Am., vol. 139, no. 1, pp. 301–310,
2016.
[19] D. H. Whalen and Y. Xu, “Information for Mandarin tones in the
amplitude contour and in brief segments,Phonetica, vol. 49, no. 1,
pp. 25–47, 1992.
[20] Q.-J. Fu and F.-G. Zeng, “Identification of temporal envelope cues in
Chinese tone recognition,” Asia Pacific J. Speech, Lang. Hear.,vol.5,
no. 1, pp. 45–57, 2000.
[21] Y.-Y. Kong and F.-G. Zeng, “Temporal and spectral cues in Mandarin
tone recognition,” J. Acoust. Soc. Am., vol. 120, no. 5, pp. 2830–2840,
2006.
[22] L. Xu, Y. Tsai, and B. E. Pfingst, “Features of stimulation affecting tonal-
speech perception: Implications for cochlear prostheses,” J. Acoust. Soc.
Am., vol. 112, no. 1, pp. 247–258, 2002.
[23] L. Xu and B. E. Pfingst, “Relative importance of temporal envelope
and fine structure in lexical-tone perception (L),J. Acoust. Soc. Am.,
vol. 114, no. 6, pp. 3024–3027, 2003.
[24] X. Luo and Q.-J. Fu, “Enhancing Chinese tone recognition by manipulat-
ing amplitude envelope: Implications for cochlear implants,” J. Acoust.
Soc. Am., vol. 116, no. 6, pp. 3659–3667, 2004.
[25] S. Duanmu, The Phonology of Standard Chinese. London, U.K.: Oxford
Univ. Press, 2002, pp. 225–253.
[26] J. Plantinga and L. J. Trainor, “Memory for melody: Infants use a relative
pitch code,” Cognition, vol. 98, no. 1, pp. 1–11, 2005.
[27] J. H. McDermott, A. J. Lehr, and A. J. Oxenham, “Is relative pitch
specific to pitch?” Psychol. Sci., vol. 19, no. 12, pp. 1263–1271, 2008.
[28] M. Cousineau, L. Demany, B. Meyer, and D. Pressnitzer, “What breaks a
melody: Perceiving F0 and intensity sequences with a cochlear implant,”
Hearing Res., vol. 269, nos. 1–2, pp. 34–41, 2010.
[29] X. Luo, M. E. Masterson, and C.-C. Wu, “Contour identification with
pitch and loudness cues using cochlear implants,” J. Acoust. Soc. Am.,
vol. 135, no. 1, pp. EL8–EL14, 2014.
[30] L. Cabrera, F.-M. Tsao, D. Gnansia, J. Bertoncini, and C. Lorenzi, “The
role of spectro-temporal fine structure cues in lexical-tone discrimination
for French and Mandarin listeners,” J. Acoust. Soc. Am., vol. 136, no. 2,
pp. 877–882, 2014.
[31] L. Cabrera et al., “The perception of speech modulation cues in lexical
tones is guided by early language-specific experience,” Front. Psychol.,
vol. 6, p. 1290, 2015.
[32] M. Brookes. (2014). Voicebox: Speech Processing Toolbox for
MATLAB. [Online]. Available: http://www.ee.ic.ac.uk/hp/staff/dmb/
voicebox/voicebox.html
[33] P. Boersma and D. Weenink. (2014). Praat: Doing phonetics by Com-
puter (Version 5. 3. 79). [online]. Available: http://www.praat.org/
[34] R. V. Shannon, F.-G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid,
“Speech recognition with primarily temporal cues,” Science, vol. 270,
no. 5234, pp. 303–304, 1995.
[35] A. E. Vandali, L. A. Whitford, K. L. Plant, and G. M. Clark, “Speech
perception as a function of electrical stimulation rate: Using the nucleus
24 cochlear implant system,” Ear Hear., vol. 21, no. 6, pp. 608–624,
2000.
[36] F. G. Zeng et al., “Development and evaluation of the nurotron
26-electrode cochlear implant system,” Hear. Res., vol. 322,
pp. 188–199, Apr. 2015.
[37] D. Riss et al., “FS4, FS4-p, and FSP: A 4-month crossover study of
3 fine structure sound-coding strategies,” Ear Hear., vol. 35, no. 6,
pp. e272–e281, 2014.
Qinglin Meng receivedtheB.S.degreeinelec-
tronic information science from Harbin Engineer-
ing University, Harbin China, in 2008, and the
Ph.D. degree in signal processing from the Insti-
tute of Acoustics, Chinese Academy of Sciences,
Beijing, China, in 2013.
From 2013 to 2016, he was a postdoc
researcher at the College of Information Engi-
neering, Shenzhen University, China. He is cur-
rently a lecturer at the School of Physics and
Optoelectronics, South China University of Tech-
nology, China. His research focuses on cochlear implant technology,
psychoacoustics, and physiological acoustics.
Nengheng Zheng (M’06) received the
B.S. degree in electronic engineering and
the M.S. degree in acoustics from Nanjing
University, Nanjing, China, in 1997 and 2002,
respectively, and the Ph.D. degree in electronic
engineering from the Chinese University of
Hong Kong, Hong Kong SAR of China, in 2006.
He is currently an Associate Professor at the
College of Information Engineering, Shenzhen
University, China. From 2014 to 2015, he was
a visiting scholar at the School of Electrical
Engineering and Telecommunications, University of New South Wales,
Australia. His research focuses on speech and audio signal processing
for human and machine perceptions.
Xia Li (M’08) was born in 1968. She received the
B.S. degree in electronics engineering and the
M.S. degree in signal and information processing
from Xidian University, Xi’an, China, and the
Ph.D. degree in information engineering from the
Chinese University of Hong Kong, in 1997.
She is currently a Professor with the College
of Information Engineering, Shenzhen University,
China. Her current research interests include
computational intelligence, image processing,
and pattern recognition.
... C OCHLEAR implants (CIs) have been relatively successful in enabling most of their users to achieve good performance in speech perception in quiet [1]. However, CI users are still experiencing poor pitch perception, and thus are still struggling in other listening tasks such as speech-innoise perception [2], music perception [3], speech intonation perception [4], and lexical tone perception in tonal languages [5]. In current CIs, input sounds are divided into sub-band channels. ...
... Then, the strategy processing was applied to get the electrodogram, and a spectral centroid was calculated using Eq. (5). The centroid distance between the lower and upper mean centroids was labeled in the number besides each line segment. ...
Article
Full-text available
The temporal-limits-encoder (TLE) strategy has been proposed to enhance the representation of temporal fine structure (TFS) in cochlear implants (CIs), which is vital for many aspects of sound perception but is typically discarded by most modern CI strategies. TLE works by computing an envelope modulator that is within the temporal pitch limits of CI electric hearing. This paper examines the TFS information encoded by TLE and evaluates the salience and usefulness of this information in CI users. Two experiments were conducted to compare pitch perception performance of TLE versus the widely-used Advanced Combinational Encoder (ACE) strategy. Experiment 1 investigated whether TLE processing improved pitch discrimination compared to ACE. Experiment 2 parametrically examined the effect of changing the lower frequency limit of the TLE modulator on pitch ranking. In both experiments, F0 difference limens were measured with synthetic harmonic complex tones using an adaptive procedure. Signal analysis of the outputs of TLE and ACE strategies showed that TLE introduces important temporal pitch cues that are not available with ACE. Results showed an improvement in pitch discrimination with TLE when the acoustic input had a lower F0 frequency. No significant effect of lower frequency limit was observed for pitch ranking, though a lower limit did tend to provide better outcomes. These results suggest that the envelope modulation introduced by TLE can improve pitch perception for CI listeners.
... Disyllabic words from a Mandarin-tone corpus [20] designed to highlight the use of pitch cue in lexical tone perception were used. In natural speech, besides pitch contour as the primary cue for lexical tone perception, there are also some secondary cues such as the loudness contour and duration [20,21,22,23]. In the corpus used in this study, the effects of the secondary cues are eliminated by manipulating the loudness and pitch contour of the syllables. ...
... Vocoder simulation experiments have been reported in a very large number of papers (e.g. see Kong et al. [3] and Table I of Stone et al. [4]), to investigate numerous parameters of CI design and use, including channel number [5,6], modulation rate [7], intensity resolution [8], electrode insertion depth [9,10] and frequency allocation [11,12], spatial hearing [13,14], fundamental frequency discrimination [15,16] and its contribution to speech segregation [17,18], lexical tone perception [19,20,21], natural sound distortion perception [22,23], or the evaluation of novel envelope-based strategies [24,25]. A demonstration can be found in [26]. ...
... Different weighting strategies were found in four CI participants, in that two participants relied more on loudness cues, and the other two participants relied more on pitch cues. The influence of loudness (or amplitude) contour on CI tone recognition has been demonstrated in several studies (Luo and Fu, 2004;Meng et al., 2016Meng et al., , 2018Ping et al., 2017;Kim et al., 2021). ...
Article
Full-text available
Despite pitch being considered the primary cue for discriminating lexical tones, there are secondary cues such as loudness contour and duration, which may allow some cochlear implant (CI) tone discrimination even with severely degraded pitch cues. To isolate pitch cues from other cues, we developed a new disyllabic word stimulus set (Di) whose primary (pitch) and secondary (loudness) cue varied independently. This Di set consists of 270 disyllabic words, each having a distinct meaning depending on the perceived tone. Thus, listeners who hear the primary pitch cue clearly may hear a different meaning from listeners who struggle with the pitch cue and must rely on the secondary loudness contour. A lexical tone recognition experiment was conducted, which compared Di with a monosyllabic set of natural recordings. Seventeen CI users and eight normal-hearing (NH) listeners took part in the experiment. Results showed that CI users had poorer pitch cues encoding and their tone recognition performance was significantly influenced by the “missing” or “confusing” secondary cues with the Di corpus. The pitch-contour-based tone recognition is still far from satisfactory for CI users compared to NH listeners, even if some appear to integrate multiple cues to achieve high scores. This disyllabic corpus could be used to examine the performance of pitch recognition of CI users and the effectiveness of pitch cue enhancement based Mandarin tone enhancement strategies. The Di corpus is freely available online: https://github.com/BetterCI/DiTone.
... 29 I. INTRODUCTION 30 C OCHLEAR implants (CIs) have been relatively success-31 ful in enabling most of their users to achieve good per-32 formance in speech perception in quiet [1]. However, CI users 33 are still experiencing poor pitch perception, and thus are still 34 struggling in other listening tasks such as speech-in-noise 35 perception [2], music perception [3], speech intonation per-36 ception [4], and lexical tone perception in tonal languages [5]. 37 In current CIs, input sounds are divided into sub-band chan-38 nels. ...
Article
No PDF available ABSTRACT To enhance the mostly discarded temporal fine structure in modern cochlear implant (CI) strategies, a temporal-limits-encoder (TLE) strategy was proposed by downshifting the high-frequency-band-limited signal to a low-frequency-temporal-pitch-limits range of CIs [Meng et al., J. Acoust. Soc. Am.(2016)]. This study investigates pitch perception with a TLE strategy compared with a standard advanced combinational encoder (ACE) strategy. Seven CI subjects were tested in a complex-tone-pitch discrimination task measuring the fundamental frequency difference limens (F0DLs) at four reference F0s (250, 313, 1000, and 1063 Hz, which are the center and upper cross-over frequencies of two bands). Results show that (1) the CI listeners generally had lower F0DLs with TLE than with ACE (group mean F0DL benefits of TLE over ACE of 5.0, 9.6, 0.5 and 4.3 percentage points at the four reference F0s, respectively) and (2) the two strategies had comparable sentence recognition performance in both quiet and noisy conditions. These findings suggest that the slowly varying TFS introduced by TLE is feasible in pitch discrimination for CI listeners and is not significantly detrimental to sentence recognition. This discrimination advantage can be explained by larger differences in the temporal fluctuations on individual channels with TLE than with ACE.
Article
Segregation and integration are two fundamental yet competing computations in cognition. For example, in serial speech processing, stable perception necessitates the sequential establishment of perceptual representations to remove irrelevant features for achieving invariance. Whereas multiple features need to combine to create a coherent percept. How to simultaneously achieve seemingly contradicted computations of segregation and integration in a serial process is unclear. To investigate their neural mechanisms, we used loudness and lexical tones as a research model and employed a novel multilevel oddball paradigm with Electroencephalogram (EEG) recordings to explore the dynamics of mismatch negativity (MMN) responses to their deviants. When two types of deviants were presented separately, distinct topographies of MMNs to loudness and tones were observed at different latencies (loudness earlier), supporting the sequential dynamics of independent representations for two features. When they changed simultaneously, the latency of responses to tones became shorter and aligned with that to loudness, while the topographies remained independent, yielding the combined MMN as a linear additive of single MMNs of loudness and tones. These results suggest that neural dynamics can be temporally synchronized to distinct sensory features and balance the computational demands of segregation and integration, grounding for invariance and feature binding in serial processing.
Article
To investigate Mandarin Tone 2 production of disyllabic words of prelingually deafened children with a cochlear implant (CI) and a contralateral hearing aid (HA) and to evaluate the relationship between their demographic variables and tone-production ability. Thirty prelingually Mandarin-speaking preschoolers with CI+HA and 30 age-matched normal-hearing (NH) children participated in the study. Fourteen disyllabic words were recorded from each child. A total of 840 tokens (14 × 60) were then used in tone-perception tests in which four speech therapists participated. The production of T2-related disyllabic words of the bimodal group was significantly worse than that of the NH group, as reflected in the overall accuracy (88.57% ± 16.31% vs 99.29% ± 21.79%, p < 0.05), the accuracy of T1+T2 (93.33% vs 100%), the accuracy of T2+T1 (66.67 ± 37.91% vs 98.33 ± 9.13%), and the accuracy of T2+T4 (78.33 ± 33.95% vs 100%). In addition, the bimodal group showed significantly inferior production accuracy of T2+T1 than T2+T2 and T3+T2, p < 0.05. Both bimodal age and implantation age were significantly negatively correlated with the overall production accuracy, p < 0.05. For the error patterns, bimodal participants experienced more errors when T2 was in the first position of the tone combination, and T2 was most likely to be mispronounced as T1 and T3. Bimodal patients aged 3-5 have T2-related disyllabic lexical tone production defects, and their performances are related to tone combination, implantation age, and bimodal age.
Article
Conventional cochlear implants using periodic sampling are power consuming and incapable of capturing the amplitude and phase of the input acoustic signal simultaneously. This paper presents an asynchronous event-driven encoder chip for cochlear implants capable of extracting the temporal fine structure. The chip architecture is based on asynchronous delta modulation (ADM) where the signal peak/trough crossing events are captured and digitized intrinsically, which has the advantages of significantly reduced power consumption, reduced circuit area, and the elimination of dedicated data compression circuitry. An 8-channel prototype chip was fabricated in 0.18 μm 1P6M CMOS process, occupying an area of 0.15×1.7mm20.15 \times 1.7 mm^2 and has a power consumption of 36.2 μW from a 0.6V supply. A 16-channel stimulation encoding system was built by integrating two test chips, capable of processing the entire audible frequency range from 100 Hz to 10 kHz. Experimental characterization using the human voice is provided to corroborate functionality in the application environment.
Article
Full-text available
Temporal envelope-based signal processing strategies are widely used in cochlear-implant(CI) systems. It is well recognized that the inability to convey temporal fine structure (TFS) in the stimuli limits CI users&apos; performance, but it is still unclear how to effectively deliver the TFS. A strategy known as the temporal limits encoder (TLE), which employs an approach to derive the amplitude modulator to generate the stimuli coded in an interleaved-sampling strategy, has recently been proposed. The TLE modulator contains information related to the original temporal envelope and a slow-varying TFS from the band signal. In this paper, theoretical analyses are presented to demonstrate the superiority of TLE compared with two existing strategies, the clinically available continuous-interleaved-sampling (CIS) strategy and the experimental harmonic-single-sideband-encoder strategy. Perceptual experiments with vocoder simulations in normal-hearing listeners are conducted to compare the performance of TLE and CIS on two tasks (i.e., Mandarin speech reception in babble noise and tone recognition in quiet). The performance of the TLE modulator is mostly better than (for most tone-band vocoders) or comparable to (for noise-band vocoders) the CIS modulator on both tasks. This work implies that there is some potential for improving the representation of TFS with CIs by using a TLE strategy.
Article
Full-text available
A number of studies showed that infants reorganize their perception of speech sounds according to their native language categories during their first year of life. Still, information is lacking about the contribution of basic auditory mechanisms to this process. This study aimed to evaluate when native language experience starts to noticeably affect the perceptual processing of basic acoustic cues [i.e., frequency-modulation (FM) and amplitude-modulation information] known to be crucial for speech perception in adults. The discrimination of a lexical-tone contrast (rising versus low) was assessed in 6- and 10-month-old infants learning either French or Mandarin using a visual habituation paradigm. The lexical tones were presented in two conditions designed to either keep intact or to severely degrade the FM and fine spectral cues needed to accurately perceive voice-pitch trajectory. A third condition was designed to assess the discrimination of the same voice-pitch trajectories using click trains containing only the FM cues related to the fundamental-frequency (F0) in French- and Mandarin-learning 10-month-old infants. Results showed that the younger infants of both language groups and the Mandarin-learning 10-month-olds discriminated the intact lexical-tone contrast while French-learning 10-month-olds failed. However, only the French 10-month-olds discriminated degraded lexical tones when FM, and thus voice-pitch cues were reduced. Moreover, Mandarin-learning 10-month-olds were found to discriminate the pitch trajectories as presented in click trains better than French infants. Altogether, these results reveal that the perceptual reorganization occurring during the first year of life for lexical tones is coupled with changes in the auditory ability to use speech modulation cues.
Article
Full-text available
For tonal languages such as Mandarin Chinese, tone recognition is important for understanding the meaning of words, phrases or sentences. While fundamental frequency carries the most distinctive information for tone recognition, waveform temporal envelope cues can also produce a high level of tone recognition. This study attempts to identify what types of temporal envelope cues contribute to tone recognition and whether these temporal envelope cues are dependent on speakers and vowel contexts. Several signal-correlated-noise stimuli were generated to separate the contribution of three major temporal envelope cues – duration, amplitude contour, and periodicity – to tone recognition. Perceptual results show that the duration cue contributed mostly to discrimination of Tone-3, the amplitude cue contributed mostly to Tone-3 and Tone-4 discrimination, and the periodicity cue contributed to recognition of all tones. However, tone recognition based on temporal envelope cues was highly variable across speakers and vowel contexts. Acoustic analysis of these temporal envelope cues revealed that this variability in tone recognition is directly related to the acoustic variability between the amplitude contour and fundamental frequency contour.
Article
Full-text available
Although the cochlear implant has been widely acknowledged as the most successful neural prosthesis, only a fraction of hearing-impaired people who can potentially benefit from a cochlear implant have actually received one due to its limited awareness, accessibility, and affordability. To help overcome these limitations, a 26-electrode cochlear implant has been developed to receive China’s Food and Drug Administration (CFDA) approval in 2011 and Conformité Européenne (CE) Marking in 2012. The present article describes design philosophy, system specification, and technical verification of the Nurotron device, which includes advanced digital signal processing and 4 current sources with multiple amplitude resolutions that not only are compatible with perceptual capability but also allow interleaved or simultaneous stimulation. The article also presents 3-year longitudinal evaluation data from 60 human subjects who have received the Nurotron device. The objective measures show that electrode impedance decreased within the first month of device use, but was stable until a slight increase at the end of two years. The subjective loudness measures show that electric stimulation threshold was stable while the maximal comfort level increased over the 3 years. Mandarin sentence recognition increased from the pre-surgical 0%-correct score to a plateau of about 80% correct with 6-month use of the device. Both indirect and direct comparisons indicate indistinguishable performance differences between the Nurotron system and other commercially available devices. The present 26-electrode cochlear implant has already helped to lower the price of cochlear implantation in China and will likely contribute to increased cochlear implant access and success in the rest of the world.
Article
Full-text available
The role of spectro-temporal modulation cues in conveying tonal information for lexical tones was assessed in native-Mandarin and native-French adult listeners using a lexical-tone discrimination task. The fundamental frequency (F0) of Thai tones was either degraded using an 8-band vocoder that reduced fine spectral details and frequency-modulation cues, or extracted and used to modulate the F0 of click trains. Mandarin listeners scored lower than French listeners in the discrimination of vocoded lexical tones. For click trains, Mandarin listeners outperformed French listeners. These preliminary results suggest that the perceptual weight of the fine spectro-temporal modulation cues conveying F0 information is enhanced for adults speaking a tonal language.
Article
Temporal cues contribute significantly to Mandarin tone recognition, but the relevance of formant frequencies is debatable. This study investigates the relative contribution of temporal and formant cues to tone recognition. Three sets of Mandarin stimuli were created. Recorded whispered speech was used to test the contribution of the formant cue, 1‐ and 8‐band noise‐modulated speech with 50‐Hz envelope cutoff frequency was used to test the contribution of the temporal envelope cue, and 1‐ and 8‐band speech with 500‐Hz cutoff frequency was used to test the contribution of the periodicity cue. Four normal‐hearing native Mandarin speakers participated. In quiet, subjects achieved an average 87% (8‐band) and 82% (1‐band) correct with the periodicity cue (500‐Hz cutoff), but only 72% (8‐band) and 55% (1‐band) correct without the periodicity cue (50‐Hz cutoff). Whispered speech produced an average 72% correct. From 10 to 10‐dB signal‐to‐noise ratios, this pattern of results was largely preserved, with whispered speech and 8‐band conditions without the periodicity cue showing similar performances, which was significantly better than the 1‐band without the periodicity cue, but poorer than the 8‐band with the periodicity cue. These results suggest all three cues contribute to Mandarin tone recognition, and there is a trade‐off among these cues.
Article
Objectives: Mandarin Chinese is a lexical tone language that has four tones, with a change in tone denoting a change in lexical meaning. There are few studies regarding lexical tone identification abilities in deafened children using either cochlear implants (CIs) or hearing aids (HAs). Furthermore, no study has compared the lexical tone identification abilities of deafened children with their hearing devices turned on and off. The present study aimed to investigate the lexical tone identification abilities of deafened children with CIs or HAs. Methods: Forty prelingually deafened children (20 with CIs and 20 with HAs) participated in the study. In the HA group, 20 children were binaurally aided. In the CI group, all of the children were unilaterally implanted. All of the subjects completed a computerized lexical tone pairs test with their hearing devices turned on and off. The correct answers of all items were recorded as the total score and the correct answers of the tone pairs were recorded as subtotal scores. Results: No significant differences in the tone pair identification scores were found between the CI group and HA group either with the devices turned on or off (t=1.62, p=0.11; t=1.863, p=0.07, respectively). The scores in the aided condition were higher than in the unaided condition regardless of the device used (t=22.09, p<0.001, in the HA group; t=20.20, p<0.001, in the CI group). Significantly higher scores were found in the tone pairs that contained tone 4. Age at fitting of the devices was correlated with tone identification abilities in both the CI and HA groups. Other demographic factors were not correlated with tone identification ability. Conclusions: The hearing device, whether a hearing aid or cochlear implant, is beneficial for tone identification. The lexical tone identification abilities were similar regardless of whether the subjects wore a HA or CI. Lexical tone pairs with different durations and dissimilar tone contour patterns are more easily identified. Receiving devices at earlier age tends to produce better lexical tone identification abilities in prelingually deafened children.
Article
Objectives: The aim of the present study was to compare two novel fine structure strategies "FS4" and "FS4-p" with the established fine structure processing (FSP) strategy. FS4 provides fine structure information on the apical four-electrode channels. With FS4-p, these electrodes may be stimulated in a parallel manner. The authors evaluated speech perception, sound quality, and subjective preference. Design: A longitudinal crossover study was done on postlingually deafened adults (N = 33) who were using FSP as their default strategy. Each participant was fitted with FS4, FS4-p, and FSP, for 4 months in a randomized and blinded order. After each run, an Adaptive Sentence test in noise (Oldenburger Sentence Test [OLSA]) and a Monosyllable test in quiet (Freiburger Monosyllables) were performed, and subjective sound quality was determined with a Visual Analogue Scale. At the end of the study the preferred strategy was noted. Results: Scores of the OLSA did not reveal any significant differences among the three strategies, but the Freiburger test showed a statistically significant effect (p = 0.03) with slightly worse scores for FS4 (49.7%) compared with FSP (54.3%). Performance of FS4-p (51.8%) was comparable with the other strategies. Both audiometric tests depicted a high variability among subjects. The number of best-performing strategies for each participant individually was as follows: (a) for the OLSA: FSP, N = 10.5; FS4, N = 10.5; and FS4-p, N = 12; and (b) for the Freiburger test: FSP, N = 14; FS4, N = 9; and FS4-p, N = 10. A moderate agreement was found in the best-performing strategies of the Speech tests within the participants. For sound quality, speech in quiet, classical, and pop music were assessed. No significant effects of strategy were found for speech in quiet and classical music, but auditory impression of pop music was rated as more natural in FSP compared with FS4 (p = 0.04). It is interesting that at the end of the study, a majority of the participants favored the new coding strategies over their previous default FSP (FSP, N = 13; FS4, N = 13; FS4-p, N = 7). Conclusions: In summary, FS4 and FS4-p offer new and further options in audio processor fitting, with similar levels of speech understanding in noise as FSP. This is an interesting result, given that the strategies' presentation of temporal fine structure differs from FSP. At the end of the study, 20 of 33 subjects chose either FS4 or FS4-p over their previous default strategy FSP.
Article
Objectives: This study investigates the efficacy of a cochlear implant (CI) processing method that enhances temporal periodicity cues of speech. Design: Subjects participated in word and tone identification tasks. Two processing conditions - the conventional advanced combination encoder (ACE) and tone-enhanced ACE were tested. Test materials were Cantonese disyllabic words recorded from one male and one female speaker. Speech-shaped noise was added to clean speech. The fundamental frequency information for periodicity enhancement was extracted from the clean speech. Electrical stimuli generated from the noisy speech with and without periodicity enhancement were presented via direct stimulation using a Laura 34 research processor. Subjects were asked to identify the presented word. Study sample: Seven post-lingually deafened native Cantonese-speaking CI users. Results: Percent correct word, segmental structure, and tone identification scores were calculated. While word and segmental structure identification accuracy remained similar between the two processing conditions, tone identification in noise was better with tone-enhanced ACE than with conventional ACE. Significant improvement on tone perception was found only for the female voice. Conclusions: Temporal periodicity cues are important to tone perception in noise. Pitch and tone perception by CI users could be improved when listeners received enhanced temporal periodicity cues.