Page 1
COMPUTING MELFREQUENCY CEPSTRAL COEFFICIENTS
ON THE POWER SPECTRUM
Sirko Molau, Michael Pitz, Ralf Schl¨ uter, and Hermann Ney
Lehrstuhl f¨ ur Informatik VI, Computer Science Department,
RWTH Aachen – University of Technology, 52056 Aachen, Germany
fmolau, pitz, schlueter, ney
g@informatik.rwthaachen.de
ABSTRACT
In this paper we present a method to derive Melfrequency cep
stral coefficients directly from the power spectrum of a speech
signal. We show that omitting the filterbank in signal analy
sis does not affect the word error rate. The presented approach
simplifies the speech recognizer’s front end by merging subse
quent signal analysis steps into a single one. It avoids possible
interpolation and discretization problems and results in a com
pact implementation. We show that frequency warping schemes
like vocal tract normalization (VTN) can be integrated easily in
our concept without additional computational efforts. Recog
nition test results obtained with the RWTH large vocabulary
speech recognition system are presented for two different cor
pora: The German VerbMobil II dev99 corpus, and the English
North American Business News 94 20k development corpus.
1. INTRODUCTION
Most of today’s automatic speech recognition (ASR) systems
are based on some type of Melfrequency cepstral coefficients
(MFCCs), which have proven to be effective and robust under
various conditions. This paper describes an alternative concept
to derive MFCCsdirectly fromthe power spectrum of the speech
signal. A number of subsequent steps of the traditional signal
analysis are integrated into the cepstrum transformation, which
avoids possible discretization and interpolation errors. The new
concept yields equally good recognition performance without a
filterbank, thus reduces the number of parameters that need to be
optimized.
The remainder of this paper is organized as follows: In the
next section we will briefly recapitulate the typical signal analy
sis procedure. Then we discuss in detail implementational issues
of the traditional MFCC computation and present our integrated
approach. In section 4 we will demonstrate that frequency warp
ing schemes like VTN can be easily integrated as well. Finally,
we will present recognition test results for the VerbMobil II and
the North American Business News Corpus, and draw the con
clusions of our work.
2. SIGNAL ANALYSIS
Figure 1 shows the signal analysis front end of a typical ASR
system. The speech waveform, sampled at 8 or 16 kHz, is first
differentiated (preemphasis) and cut into a number of overlap
ping segments (windowing), each 25 ms long and shifted by
10 ms. A Hamming window is multiplied and the Fourier trans
form (FFT) is computed for each frame. The power spectrum
is warped according to the Melscale in order to adapt the fre
quency resolution to the properties of the human ear. Then the
spectrum is segmented into a number of critical bands by means
of a filterbank. The filterbank typically consists of overlapping
triangular filters. A discrete cosine transformation (DCT) ap
plied to the logarithm of the filterbank outputs results in the raw
Speech Waveform
Preemphasis
Windowing
 FFT  2
VTN Warping
Mel−Frequency Warping
Filter Bank
Logarithm
DCT
Cepstral Mean Subtraction
Variance Normalization
Derivatives
LDA
Acoustic Vector
Figure 1: Typical signal analysis front end
MFCC vector. The highest cepstral coefficients are omitted to
smooth thecepstraand minimizethe influence of thepitchwhich
is irrelevant for the speech recognition process. The mean of
each cepstral component is subtracted, and the variance of each
component may also be normalized. Finally, the MFCC vector
is augmented with time derivatives. Additional transformations
like linear discriminant analysis (LDA) may further increase the
temporal context and the discriminance of the acoustic vector.
As a result signal analysis provides every 10 ms an acoustic vec
tor, which is typically of dimension 25 to 50.
3. COMPUTATION OF MFCCS
We now want to have a closer look at the computation of cep
stral coefficients from speech spectra, i.e. the signal analysis
steps between FFT and DCT. We will discuss problems of dif
ferent implementations, and finally present a method to compute
MFCCs directly on the power spectrum. Both the traditional and
the integrated approach suggested here are depicted in Figure 2.
Page 2
 FFT  2
VTN Warping
Mel−Frequency Warping
Filterbank
Logarithm
DCT
 FFT  2
Logarithm
DCT with integrated VTN−
and Mel−Frequency Warping
Figure 2: Comparison of the traditional MFCC computation
(left) with the integrated approach (right) investigated here.
3.1. Traditional Filterbank Approach
Melfrequency warping and the filterbank can be implemented
easily in the frequency domain (see Figure 3). One method is
to transform the power spectrum, i.e. to compute a Melwarped
spectrum by interpolation from the original discretefrequency
power spectrum. The advantage is that the following triangular
filters all have the same shape and can be placed uniformly at
the Melwarped spectrum. On the other hand, the discretization
may be especially critical due to the large dynamic range of the
power spectrum.
original frequency
Mel−frequency
f
fmel(f)=2595*lg(1+ f )
700Hz
fmel
Figure 3: Schematic plot of different triangular filterbank imple
mentations. The filters are either uniformly distributed at the
Melwarped spectrum, or non uniformly at the original spec
trum. In the latter case, they should be asymmetric as well.
Another way is to place the triangular filters non uniformly
at the unwarped spectrum and thereby implicitely incorporate
Melfrequency scaling [1]. However, discretization errors may
then occur if the spectral resolution is not appropriate. The low
est filters could be placed at a very few spectral lines only, and
the maximum of one of the filters may fall just inbetween two
spectral lines. In addition, the filters should not be triangular
and symmetric anymore, but bend according to the shape of the
Melfunction at the position of the filter.
Last but not least it is not clear how many filters are required
and which filter shape is optimal. Triangular filters are occa
sionally replaced by trapezoidal or more complex shaped ones
derived from auditory models, and we sometimes observed bet
ter word error rates when using filters with cosine shape.
In all cases the logarithm of the filterbank output is cosine
transformed to obtain MFCCs.
3.2. Computing MFCCs Directly On The Power Spectrum
We have investigated an alternative method to compute Mel
frequency warped cepstral coefficients directly on the power
spectrum and thereby avoid possible problems of the standard
approach.
Ignoring any spectral warping for a moment, cepstral coeffi
cients
c
kcan be derived by Eq. (1):
c
k
?
?
??
?
Z
??
d?lgjX?e
j?
?j?e
j?k
(1)
Depending on whether a filterbank is used or not,
stands for either the filterbank outputs or the power spectrum.
The sequential application of a monotone invertible fre
quency warping function
be expressed as follows:
jX???j
g?????????????? and DCT can
??
??
???g???
c
k
?
?
??
?
Z
??
d?? lgjX?e
jg
??
????
?j?e
j??k
(2)
To incorporate warping directly into the cosine transforma
tion, we change the integration variable and use the derivative of
the warping function
laterapproximated inthe standard way by adiscrete sum (Eq. 4):
d?? ?d? (Eq. 3). The continuous integral is
c
k
?
?
?
Z
??
d?lgjX?e
j?
?j?e
jg???k
?g
?
???
(3)
?
?
?
N
N
?
??
X
n??
?
lgjX?e
j
??n
N
?j? cos?g
?
??n
N
?
k??
g
?
?
??n
N
??
(4)
One specifictypeof frequency warping istheMelfrequency
scaling
(5) with the sampling frequency
????, which is usually carried out according to formula
f
s[6]:
??????????lg
?
??
?f
s
??? ???Hz
?
(5)
For integration into the cosine transformation, the Mel
warping function needs to be normalized in order to meet the
criterion
???????.
??????
?
????
?????
?d?lg
?
??
?f
s
??? ???Hz
?
(6)
with
d?
?
lg
?
??
f
s
?? ???Hz
?
?
Replacing
mentation of MFCC computation with only a few lines of code.
A lookup table for constants like the derivative and the cosine
term can be precomputed, all that remains is a matrix multipli
cation on the logarithm of the power spectrum. Figure 4 shows
g??? in Eq. (4) by
????? leads to a compact imple
Page 3
the effect of the modified signal analysis on two cepstrum coeffi
cients for a test sentence from the VerbMobil II corpus. Whereas
the lower order coefficients are almost identical, the difference
increases with higher coefficient orders due to the discarted fil
terbank.
6000
5000
4000
3000
2000
1000
0
1000
2000
3000
4000
050100 150200250 300 350400
Time Frame
Baseline
Integrated
1000
800
600
400
200
0
200
400
600
800
1000
050100 150 200250300 350400
Time Frame
Baseline
Integrated
Figure 4: Comparison of cepstrum coefficients 1 (upper curve)
and 15 (lower curve) for a test sentence from the VerbMobil II
corpus (baseline: traditional filterbank approach; integrated:
DCT with integrated Melfrequency warping).
4. INTEGRATION OF VTN
Vocal tract length normalization (VTN) is a speaker normaliza
tion scheme that also relies on warping the power spectrum. The
idea is to compensate for the shift of formants in speech spectra
caused by the speakerspecific length of the vocal tract.
It has been shown before that one possible VTN implemen
tation is to modify the location of filters in the filterbank just as
for Melfrequency scaling [2]. From what we have presented in
the previous section it is clear, however, that VTN can also be
fully integrated into the cepstrum transformation.
The VTN warping function
be monotone and invertible as well. A simple choice is a piece
wise linear warping function as shown in Figure 5. The inflexion
frequency
on
?
?
? ?????? ????? needs to
?
?at which the slope of the function changes depends
?:
?
?
?
?
?
?
?
?
?
?
????
?
???
????
In order to avoid complicated case distinctions for different
warping factors and frequencies, we write the warping function
???
?
??? in the following convenient form
?
?
?????
?
???
?
(7)
α>1
α<1
α=1
π
original frequency
normalized frequency
ω
να(ω)
π
7
8π
7
8π
Figure 5: Warping function for piecewise linear VTN
withparameters
depend on
?
?and
?
?. Although these parameters formally
?, they can take on only two values:
?
?
?
?
????
?
?? ??
?
???
?
???
?
?
?
?
?
????
?
??????
???
?
???
?
???
?
Melwarping isappliedafterthespectraarescaledaccording
to VTN. Hence, the combination
becomes
???? of Mel and VTN warping
?????????
?
?? ??
?d? lg
?
??
??
?
???
?
??f
s
??????Hz
?
(8)
with the derivative:
?
?
????
d??
?
?f
s
???????Hz???
?
???
?
??f
s
??ln????(9)
Cepstrum coefficients with integrated VTN and Mel
frequency warping are obtained by replacing
by
g??? in Eq. (4)
????.
5. RECOGNITION TESTS
Toevaluate theproposed signalanalysis approach, weperformed
recognition tests with the RWTH large vocabulary speech recog
nition system (see [3] , [4], and [5] for detailed system descrip
tions) on two different corpora. The VerbMobil II task (VM II)
is German spontaneous speech with a 10kword vocabulary, and
the North American Business News task (NAB) is clean read
speech of Wall Street Journal texts with a recognition vocabu
lary of 20k. Details of the training and test corpora are given in
Table 1.
Table 1: Statistics of the training and test corpora
CorpusVerbMobil II
Training
CD141
Duration61.5h
Sil. Portion 13%
# Speakers 857
# Sent.36,015
# Words701,512
Trigram PP.
Wall Street Journal
Training
WSJ0+1
81.4h
27%
284
37,571
649,624

Test
DEV99
1.6h
11%
16
1,081
14,662
62.0
Test
DEV94
0.8h
19%
20
310
7,378
126.6
Page 4
0
0.88
50
100
150
200
250
0.920.9611.041.081.12
# speakers
alpha
545 male speakers
581 female speakers
0
0.88
50
100
150
200
250
0.920.9611.041.081.12
# speakers
alpha
545 male speakers
581 female speakers
Figure 6: Warping factor distribution of the VM II training
speakers. Theupper histogram wasobtained withMel and VTN
warped spectra obtained by linear interpolation, the lower his
togram with integrated Melfrequency and VTN warping.
The first result of using the integrated approach in VTN
training was a much smoother distribution of warping factors.
Figure 6 shows the corresponding histograms for the VM II
training corpus. A closer inspection revealed that linear inter
polation of spectral lines when transforming the power spectrum
for VTN warping was the main reason for the erratic distribution
observed before. It turned out, however, that the word error rate
(WER) was only marginally affected by this difference.
Next, we compared the recognition performance of the tra
ditional signal analysis approach (baseline) with the integrated
MFCC computation. Additional tests were carried out with two
pass and fast VTN as described in [4]. The best results of each
setup are summarized in Table 2.
Table 2: Recognition test results for the VM II and the NAB cor
pus applying no VTN, twopass, and fast VTN (baseline: tradi
tional filterbank approach; integrated: DCT with integrated fre
quency warping).
Corp.VTNCepstrum #Dns
[k]
455
457
450
451
450
451
596
599
563
591
563
591
Overall [%]
Del  Ins
4.9  4.8
5.0  4.4
4.4  4.3
4.9  4.1
4.5  4.5
5.0  4.1
1.5  2.3
1.5  2.3
1.4  2.4
1.4  2.2
1.4  2.3
1.5  2.2
WER
25.7
25.3
23.8
24.0
23.8
24.0
12.5
12.4
11.8
11.7
11.9
11.8
VM IIno Baseline
Integrated
Baseline
Integrated
Baseline
Integrated
Baseline
Integrated
Baseline
Integrated
Baseline
Integrated
2Pass
Fast
NAB no
2Pass
Fast
We found that the recognition performance of both meth
ods is similar. In most cases the integrated approach performed
almost as good or slightly better than the traditional sequential
analysis with a filterbank. A similar behaviour was found on
smaller German and Italian telephone speech corpora (VerbMo
bil and EuTrans).
6. CONCLUSIONS
In this paper we have presented an alternative signal analy
sis approach that merges a number of subsequent analysis step
into one. Omitting the filterbank and integrating Melfrequency
warping into the cepstrum transformation simplifies the signal
analysis (no filterbank parameters need to be optimized), avoids
possible interpolation and discretization problems, and leads to
a compact implementation of the MFCC front end. We have
shown that concepts like VTN that rely on warping speech spec
tra can be easily integrated as well. Recognition tests on the
VerbMobil II and the North American Business News corpus re
vealed that the new approach performs as good as the traditional
signal analysis.
7. REFERENCES
[1] S. B. Davis and P. Mermelstein: “Comparison of para
metricrepresentationsformonosyllabic wordrecognition
in continuously spoken sentences”, IEEE Transactions
on Acoustic, Speech, and Signal Processing, Vol. 28,
No. 4, August 1980, pp. 357–366.
[2] L. Lee and R. Rose:“Speaker normalization using
efficient frequency warping procedures”, Proc. IEEE
Int. Conf. on Acoustics, Speech and Signal Processing,
Atlanta, GA, Mai 1996, pp. 353–356.
[3] H. Ney, L. Welling, S. Ortmanns, K. Beulen, and F. Wes
sel: “The RWTH Large Vocabulary Continuous Speech
Recognition System”, Proc. IEEE Int. Conf. on Acous
tics, Speech and Signal Processing, Seattle, WA, May
1998, pp. 853–856.
[4] A. Sixtus, S.Molau, S. Kanthak, R. Schl¨ uter, and H. Ney:
“Recent Improvements of the RWTH Large Vocabulary
Speech Recognition System on Spontaneous Speech”,
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal
Processing, Istanbul, Turkey, June 2000, pp. 1671–1674.
[5] L. Welling, N. Haberland, and H. Ney: “Acoustic Front
End Optimization for Large Vocabulary Speech Recogni
tion”, Proc. EUROSPEECH, Rhodes, Greece, September
1997, pp. 2099–2102.
[6] S. J. Young: “HTK: hidden markov model toolkit V1.4”,
User Manual, Cambridge University Engineering De
partment, Cambridge, England, February 1993.