The effects of frequency-place shift on consonant confusion in cochlear implant simulations.
ABSTRACT The effects of frequency-place shift on consonant recognition and confusion matrices were examined. Frequency-place shift was manipulated using a noise-excited vocoder with 4 to 16 channels. In the vocoder processing, the location of the most apical carrier band varied from the matched condition (i.e., 28 mm from the base of the cochlear) to a basal shift (i.e., 22 mm from the base) in a step size of 1 mm. Ten normal-hearing subjects participated in the 20-alternative forced-choice test, where the consonants were presented in a /Ca/ context. Shift of 3 mm or more caused the consonant recognition scores to decrease significantly. The effects of spectral resolution disappeared when the amount of shift reached >or=3 mm. Information transmitted for voicing and place of articulation varied with spectral shift and spectral resolution, while information transmitted for manner was affected only by spectral shift but not spectral resolution. Spectral shift has shown specific effects on the confusion patterns of the consonants. The direction of errors reversed as spectral shift increased and the patterns of reversal were consistent across channel conditions. Overall, transmission of the consonant features can be accounted for by the acoustic features of the speech signal.
-
Citations (0)
-
Cited In (0)
Page 1
The effects of frequency-place shift on consonant confusion in
cochlear implant simulations
Ning Zhou, Li Xu,a?and Chao-Yang Lee
School of Hearing, Speech and Language Sciences, Ohio University, Athens, Ohio 45701
?Received 1 April 2009; revised 26 April 2010; accepted 4 May 2010?
The effects of frequency-place shift on consonant recognition and confusion matrices were
examined. Frequency-place shift was manipulated using a noise-excited vocoder with 4 to 16
channels. In the vocoder processing, the location of the most apical carrier band varied from the
matched condition ?i.e., 28 mm from the base of the cochlear? to a basal shift ?i.e., 22 mm from the
base? in a step size of 1 mm. Ten normal-hearing subjects participated in the 20-alternative
forced-choice test, where the consonants were presented in a /Ca/ context. Shift of 3 mm or more
caused the consonant recognition scores to decrease significantly. The effects of spectral resolution
disappeared when the amount of shift reached ?3 mm. Information transmitted for voicing and
place of articulation varied with spectral shift and spectral resolution, while information transmitted
for manner was affected only by spectral shift but not spectral resolution. Spectral shift has shown
specific effects on the confusion patterns of the consonants. The direction of errors reversed as
spectral shift increased and the patterns of reversal were consistent across channel conditions.
Overall, transmission of the consonant features can be accounted for by the acoustic features of the
speech signal. © 2010 Acoustical Society of America. ?DOI: 10.1121/1.3436558?
PACS number?s?: 43.71.Es ?KWG?
Pages: 401–409
I. INTRODUCTION
Speech perception has proven to be fairly robust when
the speech signal is distorted or reduced in information ?e.g.,
Van Tasell et al., 1987; ter Keurs et al., 1992, 1993; Baer and
Moore, 1993?. It has been shown that good speech under-
standing can be achieved with greatly reduced spectral reso-
lution ?Shannon et al., 1995?. The limited spectral informa-
tioncanbecompensated
information in the perception of degraded speech signals ?Xu
et al., 2005; Xu and Zheng, 2007; Xu and Pfingst, 2008?. In
addition, speech recognition from spectrally distorted signals
that involve a number of forms of frequency-place mismatch
has been studied ?e.g., Dorman et al., 1997; Shannon et al.,
1998; Fu and Shannon, 1999?. In normal hearing, frequency
components of an acoustic signal excite particular places in
the cochlea in a tonotopic fashion. In electric hearing with
cochlear implants, ideally, speech signals should be delivered
to excite the appropriate places in the cochlea that match the
frequency content of the acoustic signal. However, as a result
of shallow insertion of the electrode array, unequal electrode-
to-neuron distances, or compression of frequency maps, a
number of frequency-place mismatch situations may presum-
ably occur in cochlear implant stimulations ?e.g., Dorman
et al., 1997; Shannon et al., 1998, 2002; Fu and Shannon,
1999; Huss and Moore, 2005; Başkent and Shannon, 2003,
2004, 2005, 2006?. Shallow insertion of the electrode array
may result in spectral shift. In addition, in clinical mapping,
the speech spectrum is typically compressively assigned to
the electrode array, since the electrode array is too short to
withincreasedtemporal
cover the entire speech spectrum. Localized fiber loss and
current spread, commonly found in implanted ears, may
cause frequency warping.
In acoustic simulations of cochlear implants, the speech
signal is analyzed in a number of spectral channels that are
referred to as the analysis bands. The output from each band
is rectified and lowpass filtered to extract the temporal enve-
lope. The temporal envelope is used to amplitude modulate a
white noise in a noise-excited vocoder or a pure tone in a
tone-excited vocoder. The modulated signal is then bandpass
filtered through the carrier bands. A full insertion of cochlear
implant is simulated by matching the analysis bands and car-
rier bands in frequency. Simulation of shallow insertion is
realized by shifting the carrier bands to higher frequencies
relative to the analysis bands.
Research has shown that the basal spectral shift presum-
ably resulting from shallow insertion of implants has an im-
mediate detrimental effect on English speech recognition.
Dorman et al. ?1997? simulated four shallow insertion depths
of a cochlear implant using a five-channel tone vocoder.
They have shown that performance of sentence and vowel
recognition progressively worsened as the simulated inser-
tion depth became shallower. Performance of insertion
depths of 22 and 23 mm ?i.e., 3 to 4 mm basal shift? signifi-
cantly differed from that of full insertion. In one of the ex-
periments by Shannon et al. ?1998?, an 8-mm basal shift was
simulated alone to be compared with a full insertion condi-
tion in a 4-channel noise-excited vocoder. Vowel recognition
accuracy was significantly reduced and sentence recognition
accuracy almost became zero. Similar results of vowel rec-
ognition were reported by Fu and Shannon ?1999? that per-
formance dropped from 80% correct in full insertion to 20%
correct after the vowel spectrum was basally shifted by ap-
a?Author to whom correspondence should be addressed. Electronic mail:
xul@ohio.edu
J. Acoust. Soc. Am. 128 ?1?, July 2010© 2010 Acoustical Society of America4010001-4966/2010/128?1?/401/9/$25.00
Author's complimentary copy
Page 2
proximately 7 mm along the cochlea. A significant decrease
in vowel recognition was found when the spectrum was ba-
sally shifted by 3 mm or more.
There seems to be a consensus in the literature indicat-
ing that vowel and sentence recognition are greatly suscep-
tible to frequency-place shift as a result of shallow insertion
?Dorman et al., 1997; Shannon et al., 1998; Fu and Shannon,
1999?. As carriers of the temporal envelopes shift, so does
the location of the formant frequencies that serves as a criti-
cal acoustic correlate for vowel recognition. The reduced
vowel recognition in turn partially accounts for the deterio-
rated sentence recognition. To avoid frequency-place shift,
the shallow inserted electrodes have to be mapped with tono-
topically matched speech content. Faulkner et al. ?2003? has
shown that in cases of shallow insertion, tonotopic mapping
does not necessarily recover speech recognition, since a large
part of low frequency content in the speech signal is lost.
However, the acute effects of spectral shift can be compen-
sated to a certain extent by training ?Rosen et al., 1999; Fu
et al., 2002; Fu and Galvin, 2003; Faulkner, 2006?. That is,
human brains have shown to be able to learn and adapt to the
frequency-shifted speech signals.
In contrast, studies that examined the effects of spectral
shift on consonant recognition have yielded somewhat mixed
findings. Dorman et al. ?1997? reported that consonant rec-
ognition scores underwent smaller decreases compared to
vowel and sentence recognition, but the decrease was still
found to be significant with a shift of 2 mm or more. Con-
sistent with the findings by Dorman et al. ?1997?, Rosen
et al. ?1999? reported significantly reduced consonant recog-
nition performance as a result of basal shift, although the
decrease was smaller than that for vowels and sentences.
Rosen et al. ?1999? reasoned that consonants can be recog-
nized by the use of temporal information and gross spectral
contrast, which renders consonants more immune to
frequency-place shifts. Shannon et al. ?1998?, however,
found in their study that while an 8-mm basal shift caused
vowel recognition to greatly reduce and caused sentence rec-
ognition accuracy to essentially drop to zero, the perfor-
mance of consonant recognition was nearly intact. It is pos-
sible that the discrepancies between these studies are due to
the different frequency allocations or the use of different
speech materials.
Given the discrepancies found in the previous studies,
the effects of spectral resolution ?i.e., number of channels?
and spectral shift ?i.e., frequency-place shift? were further
examined on consonant recognition and on consonant fea-
tures. We assumed that distortion of the frequency spectrum
and varying spectral resolution could have a differential ef-
fect on information transmission of different articulatory fea-
tures of consonants. Data were analyzed in terms of detailed
articulatory features within the broad categories of manner,
place, and voicing, which have not been reported in the pre-
vious studies. It was hypothesized that features transmitted
using predominant temporal information would be less af-
fected by spectral resolution or spectral shift. On the other
hand, features that rely more on spectral information would
be increasingly affected as the amount of spectral shift in-
creases or the spectral resolution becomes poorer. Further,
none of the previous studies have reported consonant error
patterns in conditions that varied the degree of spectral res-
olution and the amount of spectral shift combined. The pri-
mary question addressed was whether consonant confusions
would vary in systematic patterns with spectral resolution
and spectral shift.
II. METHODS
A. Speech materials and signal processing
From the database recorded by Shannon et al. ?1999?, a
set of digitized naturally produced consonants from one fe-
male ?#3? and one male ?#3? speaker was drawn. The set
contained 20 consonants ??, ?, t, d, k, g, p, b, n, m, s, z, ʃ,
f, v, ð, l, r, j, w? produced in a /Ca/ context, resulting in 40
speech tokens in total ?20 consonants?2 speakers?. The
speech tokens were subjected to a noise-excited vocoder pro-
cessing. Signal processing was performed in MATLAB ?Math-
Works, Natick, MA?. The speech signals were pre-
emphasized by highpass filtering at 1200 Hz ?1st-order
Butterworth filter, 6 dB/octave? and divided into 4, 8, 12, or
16 frequency bands. The frequency range of the analysis
bands was 269–2113Hz, covering a tonotopic location be-
tween 28 mm and 16 mm from the basal end in a 35 mm
long cochlea ?Greenwood, 1990?. The Greenwood ?1990?
formula, F=165.4?100.06x−1? where x is the distance in mm
from the apex, was used to determine the bandwidths and
corner frequencies of the analysis bands. The output of each
analysis band was half-wave rectified and then low-pass fil-
tered at 160 Hz ?2nd-order Butterworth, 12 dB/octave? to
extract the temporal envelope. Each temporal envelope was
used to amplitude modulate a white noise. The modulated
signal was then bandpass filtered into either the same band
where the envelope was extracted to simulate a tonotopically
matched condition, or bandpass filtered into higher fre-
quency bands to simulate frequency-place shifted conditions.
The cut-off frequencies of carrier bands were also estimated
by the Greenwood ?1990? formula. The frequency allocation
of the carrier bands was systematically manipulated to simu-
late a basal shift from the analysis bands over a tonotopic
distance of 6 mm with a step size of 1 mm. Thus, seven
frequency-place matching conditions ?i.e., unshifted and 6
shifted? were created. The manipulation was repeated for all
four channel conditions ?i.e., 4, 8, 12, and 16?. The frequency
allocation for the carrier bands of 16 channels is provided in
Table I. Finally, the outputs of all bands were summed up
and stored on computer for acoustic presentations.
B. Subjects and test procedure
Ten native English-speaking subjects recruited from the
Ohio University student population participated in the study.
Thesubjectswerescreened
??20 dB HL? at octave frequencies between 250 and 8000
Hz. The use of human subjects was reviewed and approved
by the Ohio University Institutional Review Board.
The consonant recognition test was conducted in an IAC
sound booth. A graphic user interface was developed in MAT-
LAB to present stimuli and collect responses from the sub-
jects. The consonant stimuli were presented to the left ear of
fornormal-hearing
402J. Acoust. Soc. Am., Vol. 128, No. 1, July 2010Zhou et al.: Spectrally shifted consonant recognition
Author's complimentary copy
Page 3
the listeners via a circumaural headphone ?Sennheiser, HD
265? at a most comfortable level. The subjects adjusted the
soundcard output levels to their respective most comfortable
levels before each session of training or test. The presenta-
tion level was approximately 65?5 dB SPL. The task of the
subjects was to identify the consonant they had heard by
clicking one of the 20 buttons labeled with the CV strings
?e.g., “Ba,” “Da,” “Ga,” etc.?. The subjects were trained with
the tonotopically matched stimuli processed with 4, 8, 12,
and 16 channels. Each training session lasted about 30 min
which always started with presumably the easiest stimuli
?i.e., the 16-channel condition? followed by progressively
more difficult stimuli. During training, within each channel
condition, the presentation of stimuli was randomized. After
the subject gave a response, the button of the correct conso-
nant flashed to provide feedback. Given the reduced band-
width of the stimuli, and based on experiences from our
previous studies and literature, training was considered
adequate when recognition of the tonotopically matched
stimuli reached 60% correct. On average, each subject took
approximately 10 h in training before continuing with the
test. In the test, the presentation of the 5600 stimuli
?i.e.,20 consonants?2 speakers?4 channel
?7 frequency-place shift conditions?5 repetitions? was
completely randomized. The test was divided into a number
of sessions and took each subject approximately 5–6 h to
complete.
conditions
C. Consonant feature coding
The 20 consonants were coded according to their articu-
latory features, which included voicing ?i.e., voiced and
voiceless?, place ?i.e., labial, alveolar, palatal, and velar? and
manner of articulation ?i.e., stop, fricative, affricate, nasal,
and glide? ?Ladefoged, 1975?. The consonants were coded as
shown in Table II.
III. RESULTS
A. Effects of spectral shift and spectral resolution
Figure 1?a? summarizes the percent correct scores of
various channel and spectral shift conditions. A two-way
repeated-measure ANOVA indicated significant effects of
both number of channels ?F?3,27?=22.84, p?0.00001? and
spectral shift ?F?6,54?=144.06, p?0.00001?. Post-hoc
analysis of the main factors revealed that spectral shift of
?3 mm caused the performance to significantly decrease
from the tonotopically matched condition ?i.e., 28 mm? ?p
?0.05?. Performance with 4 channels was significantly
lower than all other channel conditions ?p?0.05?, while the
performance of 8, 12, and 16 channels did not show signifi-
cant differences from each other ?p?0.05?. ANOVA showed
that the interaction between the two factors was also statis-
tically significant ?F?18,162?=13.39, p?0.00001?. Post-hoc
analysis of the interaction revealed that when the spectral
shift was ?3 mm, the spectral resolution of the signals no
longer seemed to play a role and therefore the performance
of all channel conditions did not differ ?p?0.05?.
B. Information transmission analysis
Information transmission scores ?%? are shown in Fig.
1?b?–1?d? for the features of voicing, manner, and place of
articulation in all experimental conditions ?Miller and Nicely,
1955?. Three sets of two-way repeated-measure ANOVA
were conducted to examine the effects of number of channels
as well as the effects of place-frequency shift for all three
features. For the feature of voicing, the effects of number of
channels ?F?3,27?=5.38,
p=0.005? and spectral shift
?F?6,54?=4.64, p=0.0007? were both found to be statisti-
cally significant. For the feature of manner of articulation,
only the effects of spectral shift ?F?6,54?=4.25, p=0.001?,
but not number of channels ?F?3,27?=1.31, p=0.3? were
TABLE I. Corner frequencies of the carrier bands for seven frequency-place shift conditions in a 16-band processor. The frequency allocations of the eight-
and four-band processors can be derived from this table by combining adjacent two and four bands, respectively.
Carrier bands
12345678910111213141516
Shift ?mm?
0
1
2
3
4
5
6
269
333
407
492
589
701
829
316
388
470
564
672
800
938
368
448
539
643
763
900
1058
427
515
616
731
864
1017
1198
492
589
701
829
977
1146
1340
563
671
800
938
1101
1289
1504
643
763
900
1058
1239
1447
1686
731
864
1017
1192
1393
1624
1888
829
976
1146
1340
1563
1819
2113
938
1101
1289
1504
1751
2035
2361
1058
1239
1447
1686
1961
2276
2537
1192
1393
1624
1888
2173
2542
2943
1340
1563
1819
2113
2450
2838
3282
1504
1751
2035
2361
2736
3165
3659
1686
1961
2276
2637
3052
3529
4076
1884
2193
2542
2843
3404
3932
4392
2113
2450
2838
3282
3793
4380
5053
TABLE II. Coding of consonants. Manner of articulation is coded from 1–5 as: stop, fricative, affricate, nasal, and glide. Place of articulation is coded from
1–4 as: labial, alveolar, palatal, and velar. Voicing is coded 1 vs. 0 for voiced vs. voiceless consonants.
Consonants
??
tdkgpbnmsz
ʃ
fvðlrjw
Voicing
Manner
Place
0
3
3
1
3
3
0
1
2
1
1
2
0
1
4
1
1
4
0
1
1
1
1
1
1
4
2
1
4
1
0
2
2
1
2
2
0
2
3
0
2
1
1
2
1
1
2
2
1
5
2
1
5
3
1
5
3
1
5
4
J. Acoust. Soc. Am., Vol. 128, No. 1, July 2010Zhou et al.: Spectrally shifted consonant recognition403
Author's complimentary copy
Page 4
found to be significant. For the feature of place of articula-
tion, again, both main factors were found to be significant
?F?3,27?=17.83, p?0.00001; F?6,54?=9.08, p?0.00001?.
For all three features, the interactions between the two main
effects were not statistically significant ?voicing: F?18,162?
=1.03, p=0.43; manner: F?18,162?=1.14, p=0.32; place:
F?18,162?=1.22, p=0.25?.
Figure 2 shows the information transmitted as a function
of frequency-place shift for each specific manner of articula-
tion ?i.e., stop, fricative, affricate, nasal, and glide? and each
specific place of articulation ?i.e., labial, alveolar, palatal,
and velar?. Each specific manner and place of articulation
category ?e.g., stop? was treated as a binary feature ?e.g.,
stops vs. non-stops? in the derivation of their information
transmission scores. Voicing is already a binary feature that
has been described in Fig. 1. The effect of spectral resolution
is not shown in Fig. 2, since it was not a significant factor for
transmitting the feature of manner, and there was no interac-
tion between channels and spectral shift for the feature of
place of articulation. Figure 2 indicates that the non-
monotonic function of place of articulation overall ?Fig.
1?d?? is accounted for by the labial feature. The reason that
the labial feature improved after 3 mm shift is elaborated in
Discussion Section C.
C. Confusion analysis
Confusion matrices of data pooled across all channel
conditions are shown in Fig. 3 with different panels for dif-
ferent spectral shift conditions. Note that the quantitative
analysis of the error patterns described below was conducted
for each spectral shift and number of channel condition in
order to elucidate the effects of both factors on confusions.
In fact, such analysis revealed that the confusion error pat-
terns at different spectral resolutions were similar ?see be-
low?.
In Fig. 3, the confusion matrices were organized based
on features of manner and place of articulation. First, the
consonants that are of the same manner of articulations were
grouped together. Boundaries between manners of articula-
tion are indicated by the white squares. Within each group of
manner of articulation, the consonants were then sorted fol-
lowing the order of alveolar, palatal, velar, and labial, if ap-
plicable. This place order reflects the frequency region of the
acoustic correlates that are the most relevant to the percep-
tion of place of articulation ?Stevens, 1998; Pickett, 1999;
Johnson, 2003?.
Specifically, one of the primary acoustic correlates of
place distinction of stops is the spectral dominance of the
short-term spectrum at consonantal release ?Fant, 1973;
Stevens and Blumstein, 1978?. Such acoustic cues are usu-
ally associated with specific articulatory actions. For ex-
ample, alveolar stops are produced with a relatively short
front cavity. Therefore, they have the spectral peak located in
a relatively high frequency range. In contrast, velar stops are
produced with a longer front cavity with spectral prominence
in a lower frequency range. The production of labial stops
involves virtually no front vocal cavity, since the constriction
is made at the lips. Therefore, their spectra show a diffused
pattern, with acoustic energy distributed over the low fre-
quency range. For nasals, the frequencies of anti-formants
are associated with place distinctions. Anti-formants are lo-
cal spectral energy minimum that arises as a result of the oral
cavity absorbing acoustic energy at specific frequency ranges
from the nasal resonance system. The frequencies of the anti-
formants are associated with the length of the oral cavity. For
0
25
50
75
100
4 channel
8 channel
12 channel
16 channel
Percent correct (%)
A
25
50
75
100
BVoicing
25
50
75
100
C Manner
Information transmitted (%)
0123456
0
25
50
75
100
D Place
Basal shift (mm)
FIG. 1. Percent correct scores and information transmission scores for ar-
ticulatory features ?N=10?: ?a? averaged percent correct scores are plotted as
a function of frequency-place shift. Scores of four different channel condi-
tions are plotted in solid lines with different symbols; ??b?–?d?? averaged
information transmission scores are plotted as a function of frequency-place
shift for the features of voicing, manner of articulation, and place of articu-
lation. Scores of four different channel conditions are plotted in solid lines
with different symbols. Error bars represent SDs.
0123456
0
25
50
75
100
Information transmitted (%)
Basal shift (mm)
Manner of articulation
Stop
Fricative
Affricate
Nasal
Glide
0123456
Basal shift (mm)
Place of articulation
Labial
Alveolar
Palatal
Velar
FIG. 2. Information transmission for particular manner and place of articu-
lation ?N=10?. Left panel: Information transmitted for sub-categories of
manner of articulation is plotted as a function of frequency-place shift in
different symbols. Right panel: Information transmitted for sub-categories of
place of articulation is plotted as a function of frequency-place shift in
different symbols.
404J. Acoust. Soc. Am., Vol. 128, No. 1, July 2010Zhou et al.: Spectrally shifted consonant recognition
Author's complimentary copy
Page 5
example, the first anti-formant for the labial nasal /m/ is
lower than the first anti-formant for the alveolar /n/ because
the former involves a longer oral cavity ?Fujimura, 1962?.
For fricatives, the association between spectral prominence
and place of constriction has also been identified ?Heinz and
Stevens, 1961?. The spectral dominance is located at the
highest frequencies for alveolars followed by palatals and
bilabials. In sum, the acoustic dominance or attenuation as-
sociated with place of articulation is generally located at the
highest frequency for alveolar, followed by palatal, velar,
and labial consonants. A spectral analysis of the original
stimuli used in this study confirmed the differential spectral
characteristics associated with place of articulation as sug-
gested by the acoustical theory of speech production.
The consonant recognition data showed that error pat-
terns changed systematically as the spectral shift increased
?Fig. 3?. In the tonotopically matched condition, the confu-
sions were to a large extent confined within each manner of
articulation boundary. As the spectral shift increased, confu-
sions started to cross the boundaries of manner of articula-
Responses
Stimuli
Unshifted
td t d k p b n m s z f v l r j w
t
d
t
d
k
p
b
n
m
s
z
f
v
l
r
j
w
1-mm basal shift
td t d k p b n m s z f v l r j w
t
d
t
d
k
p
b
n
m
s
z
f
v
l
r
j
w
2-mm basal shift
td t d k p b n m s z f v l r j w
t
d
t
d
k
p
b
n
m
s
z
f
v
l
r
j
w
3-mm basal shift
td t d k p b n m s z f v l r j w
t
d
t
d
k
p
b
n
m
s
z
f
v
l
r
j
w
4-mm basal shift
td t d k p b n m s z f v l r j w
t
d
t
d
k
p
b
n
m
s
z
f
v
l
r
j
w
5-mm basal shift
td t d k p b n m s z f v l r j w
t
d
t
d
k
p
b
n
m
s
z
f
v
l
r
j
w
6-mm basal shift
td t d k p b n m s z f v l r j w
t
d
t
d
k
p
b
n
m
s
z
f
v
l
r
j
w
Percent
0
10
20
30
40
50
60
70
FIG. 3. Confusion matrices using data pooled across channel conditions ?N=10?. The white squares indicate boundaries of manner of articulation. Within each
manner of articulation, the consonants were sorted, whenever appropriate, based on place of articulation following the order of alveolar, palatal, velar, and
bilabial. The color scale in each cell represents the value in it with reference to the color bar on the right of the bottom panel. The value in the cell of in row
j and column k is the percent of times stimulus j was recognized as k ?j=1:20; k=1:20?.
J. Acoust. Soc. Am., Vol. 128, No. 1, July 2010Zhou et al.: Spectrally shifted consonant recognition405
Author's complimentary copy