EXTRACTING DEEP BOTTLENECK FEATURES USING STACKED AUTO-ENCODERS
1Interactive Systems Lab, Karlsruhe Institute of Technology; Germany
2Language Technologies Institute, Carnegie Mellon University; Pittsburgh, PA; USA
In this work, a novel training scheme for generating bottleneck fea-
tures from deep neural networks is proposed. A stack of denois-
ing auto-encoders is first trained in a layer-wise, unsupervised man-
ner. Afterwards, the bottleneck layer and an additional layer are
added and the whole network is fine-tuned to predict target phoneme
states. We perform experiments on a Cantonese conversational tele-
phone speech corpus and find that increasing the number of auto-
encoders in the network produces more useful features, but requires
pre-training, especially when little training data is available. Using
more unlabeled data for pre-training only yields additional gains.
Evaluations on larger datasets and on different system setups demon-
strate the general applicability of our approach. In terms of word
error rate, relative improvements of 9.2% (Cantonese, ML train-
ing), 9.3% (Tagalog, BMMI-SAT training), 12% (Tagalog, confu-
sion network combinations with MFCCs), and 8.7% (Switchboard)
Index Terms— Bottleneck features, Deep learning, Auto-
The two main approaches for incorporating artificial neural networks
(ANNs) in acoustic modeling today are hybrid systems and tandem
systems. In the former, a neural network is trained to estimate the
emission probabilities for Hidden Markov Models (HMM) . In
contrast, tandem systems use neural networks to generate discrimi-
native features as input values for the common combination of Gaus-
sian Mixture Models (GMM) and HMMs. This is done by training
a network to predict phonetic targets, and then either using the es-
timated target probabilities (“probabilistic features” ) or the acti-
vations of a narrow hidden layer (“bottleneck features”, BNF ).
Those features are usually fused with standard input features and
decorrelated for modeling with GMMs.
In the field of machine learning, deep learning deals with the ef-
erally inspired by the greedy, unsupervised, layer-wise pre-training
scheme first proposed by Hinton et al. . While Hinton et al.
used a deep belief network (DBN) where each layer is modeled by
a restricted Boltzmann machine (RBM), later works showed that
other architectures like auto-encoders  or convolutional neural
networks  are suitable for building deep networks using similar
schemes as well.
Neural networks have been successfully used for acoustic mod-
eling over two decades ago ,, but had been mostly abandoned
in favor of GMM/HMM acoustic models throughout the late 1990s.
However, the ability to train networks with millions of parameters
in a feasible amount of time has caused a renewed interest in con-
nectionist models . Recently, deep neural networks have been
used with great success in hybrid DNN/HMM systems, resulting in
strong improvements on challenging large-vocabulary tasks such as
Switchboard  or Bing voice search data .
2. RELATED WORK
While bottleneck features have been used in speech recognition sys-
tems for some time now, only few works on applying deep learning
techniques to this task have been published.
In 2011, Yu & Seltzer applied a deep belief network as proposed
by Hinton et al. for extracting bottleneck features, with the bottle-
neck being a small RBM placed in the middle of the network .
The network was pre-trained on frames of MFCCs including deltas
and delta-deltas, and then fine-tuned to predict either phoneme or
senone targets. They found that pre-training the RBMs increases the
accuracy of the recognition system, and that additional strong im-
provements can be achieved by using context-dependent targets for
supervised training. However, they noted that possibly due to their
symmetric placement of the bottleneck layer, increasing the number
of layers in the network to more than 5 did not improve recogni-
tion performance any further. In a more recent work, it was also
argued that RBMs are not suitable for modeling decorrelated data
like MFCCs .
Sainath et al. introduced DBN training in a previously pro-
posed architecture based on training an auto-encoder on phonetic
class probabilities estimated by a neural network ,.
their work, they first trained a stack of RBMs for classification of
speaker-adapted PLP features and applied a 2-step auto-encoder to
reduce the output of the resulting DBN to 40 bottleneck features.
These features out-performed a strong GMM/HMM system using
the same input, but they found that performance gains are higher
when training systems on little data.
This work proposes a different approach that profits from in-
creasing the model capacity by adding more hidden layers, and en-
ables the supervised training of the bottleneck layer in order to re-
trieve useful features for a GMM/HMM acoustic model. Instead of
pre-training the layers with restricted Boltzmann machines, we use
auto-encoders which are straightforward to setup and train.
3. MODEL DESCRIPTION
The architecture proposed for bottleneck feature extraction is illus-
trated in Figure 1. A deep neural network consisting of a stack
of auto-encoders is first pre-trained on frames of speech data in a
layer-wise, unsupervised manner. This process follows the standard
3377978-1-4799-0356-6/13/$31.00 ©2013 IEEEICASSP 2013
Fig. 1. Proposed architecture
scheme for pre-training a network that might be used for a classifi-
cation task later. Afterwards, the bottleneck layer followed by a hid-
den and a classification layer are added to the network. The whole
network is then fine-tuned in order to predict the phonetic targets
attached to the input frames. Since there are potentially many hid-
den layers between the input data and the bottleneck layer, we call
features extracted this way “deep bottleneck features” (DBNF).
For pre-training the stack of auto-encoders, we used denoising
auto-encoders as proposed for learning deep networks by Vincent et
al. . This model works like a standard auto-encoder (or auto-
associator) network, which is trained with the objective to learn a
hidden representation that allows it to reconstruct its input. The dif-
ference is that in order to force even very large hidden layers to ex-
tract useful features, the network is forced to reconstruct the original
input from a corrupted version, generated by adding random noise
to the data. This corruption of the input data can be formalized as
applying a stochastic process qDto an input vector x:
˜ x ∼ qD(˜ x|x)
noise to the data by setting a random fraction of the elements of x
to zero. Using the weight matrix W of the hidden layer, the bias
vector b of the hidden units and a non-linear activation function σy,
the hidden representation y, or the encoding, is then computed as
y = σy(W ˜ x + b)
In a model using tied weights, the weight values are used for both
encoding and decoding. The reconstruction z is thus obtained using
the transposed weight matrix and the visible bias vector c, again
followed by a non-linear function σz:
The resulting auto-encoder is then trained to reconstruct its orig-
inal input x from the randomly corrupted version, which is done
using standard back-propagation to minimize a corresponding error
term, orlossfunctionL(x,z). Formodelingreal-valuedspeechdata
with the first denoising auto-encoder, we use the mean squared error
z = σz
WTy + c
After the first auto-encoder has been trained, another one is
trained to encode and reconstruct the hidden representation of the
first one in a similar fashion. This time, the first auto-encoder com-
putes its encoding from the uncorrupted version of the input vector,
and corruption is applied to the input of the model being currently
trained only. Since the second auto-encoder now models the prob-
abilities of hidden units in the first auto-encoder being active, we
train it using the cross-entropy loss as suggested in :
xilogzi+ (1 − xi)logzi
After training a stack of auto-encoders in this manner, a feed-
forward neural network is constructed by connecting a small bot-
tleneck layer to the top auto-encoder, followed by another hidden
layer and the classification layer. All those layers are not being pre-
trained, but instead initialized using random weights sampled uni-
formly out of a small range, which was also done to initialize the
auto-encoder weights. The classification layer employs a softmax
activation function for estimating class probabilities, and the whole
network is trained using standard backpropagation.
4. EXPERIMENTAL SETUP
4.1. Baseline Systems
Baseline system setup and evaluation of trained networks were done
using the JANUS recognition toolkit  and IBIS decoder  in
a similar configuration as described in our previous works on bot-
tleneck features . For the baseline, samples consisting of 13
MFCCs were extracted and stacked with 11 adjacent samples, re-
sulting in a total of 143 coefficients. LDA was applied to compute
the final 42-dimensional feature vectors for the recognition system.
Acoustic model training was performed differently for each system
used for evaluation; please refer to section 4.4 for individual descrip-
tions and resulting baseline performances.
4.2. Network Training
In our experiments, we extracted deep bottleneck features from
both MFCCs and log mel scale filterbank coefficients (lMEL). The
network input for MFCCs consisted of the same 143 coefficients
as for the baseline system, sampled from the data with a context-
independent model. When using lMEL features, 30 coefficients
were extracted from the spectrogram and stacked in the same way,
thus forming a 330-dimensional feature vector. During supervised
fine-tuning, the neural network was trained to predict monophone
For pre-training the stack of auto-encoders in the proposed ar-
chitecture, mini-batch gradient descent with a batch size of 64 and
a learning rate of 0.01 was used. Input vectors were corrupted by
applying masking noise to set a random 20% of their elements to
zero. Each auto-encoder contained 1000 hidden units and received
4 million updates before its weights were fixed and the next one was
trained on top of it. Limiting the training time by model updates
rather than by epochs was done in order to be able to compare the
influence of using different datasets for pre-training.
The remaining layers were then added to the network, with the
bottleneck consisting of 42 units. Again, gradients were computed
by averaging across a mini-batch of training examples; for fine-
tuning, we used a larger batch size of 256. Supervised training was
done for 50 epochs with a learning rate of 0.05. After each epoch,
the current model was evaluated on a separate validation set, and
the model performing best on this set was used in the speech recog-
nition system afterwards. The training of auto-encoder layers and
neural networks was done on GPUs using the Theano toolkit .
4.3. Bottleneck Features
The 42 output values of the bottleneck layer of 11 adjacent frames
were stacked and reduced to a 42-dimensional feature vector us-
ing LDA. Using this feature vector, a context-dependent system was
trained starting from a context-independent MFCC baseline system
as described in section 4.1.
4.4. Corpora Description and Baseline Performance
Initial experiments for selecting optimal network layouts and hyper-
parameters were performed on a development system, which was
trained on 44 hours of Cantonese conversational telephone speech.
This corpus was recently released as the IARPA Babel Program 
Cantonese language collection babel101-v0.1d. Other releases from
this program used in this work are the Cantonese language collection
language collection. The latter comes in two versions: babel106-
v0.2f consisting of 69h of conversational telephone speech, and the
subset babel106b-v0.2g-sub-train with 9h of speech.
All baseline systems used MFCC input data and were among
initial builds on these recent and small corpora. Both Cantonese
systems used GMM/HMM acoustic models that were ML-trained
only. For Cantonese babel101-v0.1d, the baseline achieved a char-
acter error rate (CER)1of 71.5% by using a 3-gram language model
extracted from transcriptions included in the babel101-v0.1d data.
The baseline system for the full dataset babel101-v0.4c employed a
Wikipedia articles, which resulted in 66.4% CER.
In contrast to the setups for Cantonese, the Tagalog babel106-
v0.2f and babel106b-v0.2g-sub-train systems used Boosted Max-
imum Mutual Information (BMMI) discriminative training and
speaker adaptive training (SAT) for its acoustic models. A similar
language model as for Cantonese babel101-v0.4c was used, gener-
ated from the corpus transcriptions and text from Tagalog Wikipedia
articles. This resulted in a WER of 64.3% for the full version trained
on 69 hours and 72.1% WER for the babel106b-v0.2g-sub-train
For evaluating the proposed setup on a larger task in a lan-
guage more frequently used in speech recognition development,
the Switchboard dataset was used . Similar to the Cantonese
systems, acoustic models were set up using ML training, this time
on 300 hours of speech. For decoding, a 3-gram language model
was generated from the transcriptions included in the dataset, which
resulted in a WER of 39.0%.
In first experiments on Cantonese babel101-v0.1d, we compared
different network architectures and inputs. For the input data, we
found that extracting deep bottleneck features from lMEL instead of
MFCC data resulted in consistently better recognition performance
of about 2% CER absolute. This is consistent with the findings for
deep belief networks in , and although we did not perform ad-
ditional experiments to confirm that the larger dimensionality of the
lMEL input vector is not the sole reason for the performance gap,
we decided to use lMEL features for further experiments. Using
more than 1000 hidden units for each of the auto-encoder layers did
not result in improved recognition rates.
1Since Cantonese lacks a clear definition of a word, character error rate
was used instead of the usual word error rate for evaluating the respective
Network trained on 44h
Network trained on 80h
Pre-training on 80h, fine-tuning on 44h
Table 1. Performance of the Cantonese babel101-0.1d system (in %
CER) on features generated from different networks. When trained
on MFCCs, the system achieved 71.5% CER.
5.1. Importance of Pre-training
In order to determine whether pre-training of the stack of auto-
encoder in front of the bottleneck was necessary, we evaluated
the features generated by networks of different depth, on different
amounts of data and with as well as without unsupervised pre-
Table 1 provides a listing of all those experiments. When train-
ing networks on the 44 hours of Cantonese data used in the develop-
ment system, it can be seen that if pre-training is applied, recogni-
tion performance improves as more auto-encoder layers are added.
If the neural network is trained starting from the usual random ini-
tialization, the character error rate increases significantly for deeper
The same experiment was repeated on the development system
with networks trained on the full Cantonese babel101-v0.4c dataset.
As before, pre-training turned out to be essential for the network to
benefit from additional auto-encoder layers. However, the difference
between DBNFs from pre-trained and purely supervised trained net-
works was smaller in terms of CER. This may indicate that unsuper-
vised, layer-wise pre-training is particularly helpful if little data is
available, and that this effect decreases with large amounts of data.
on the 300h Switchboard benchmark, Seide et al. found that pre-
training lowers the final error rate, but only by a small margin .
5.2. Usage of Unlabeled Data
Since unsupervised pre-training was found to be an essential part in
achieving good recognition performance, we investigated whether
the network could make use of additional, unlabeled data. Espe-
error-prone, and being able to increase recognition performance by
using unlabeled data would be highly beneficial.
For Cantonese, we used the full 80 hours of data from babel101-
v0.4c for pre-training the auto-encoders but fine-tuned the network
ble 1, this resulted in additional, small improvements of up to 0.75%
relative over DBNFs from networks trained on the babel101-v0.1d
On Tagalog, we used the existing separation between the full
babel106-v0.2f and the babel106b-v0.2g-sub-train version, so the
ratio for unlabeled to labeled data was almost 5 times as high as
for the experiment on Cantonese. As shown in the two last rows of
Table 3, the word error rate could be lowered by 1.3% relative (for
the BMMI-SAT system) by pre-training the network on the full 69
5.3. Evaluation on Larger Datasets
Further evaluation of the proposed architecture was done on the full
datasets described in section 4.4. For the Cantonese babel101-v0.4c
dataset consisting of 80 hours, the best result could be achieved with
a network containing 4 pre-trained auto-encoder layers. As shown
in Table 2, the character error rate could be reduced from 66.4% to
60.3%, which corresponds to 9.2% relative improvement.
MFCCDBNF (by AE Layers)
CER (%) 66.4 63.660.360.561.1
Table 2. Character error rates of the Cantonese babel101-0.4c sys-
tem with MFCC and DBNF input features.
The Tagalog system used a speaker-adapted acoustic model that
was discriminatively trained, so we were interested in whether the
performance gains on the ML systems would persist in a more ad-
vanced setting. On this task, networks containing 5 auto-encoder
layers were found to extract the best features, so this architecture
was used for subsequent experiments. The top of Table 3 shows
that with standard ML training, an improvement of over 12% rela-
tive could be achieved when using DBNFs, lowering the WER from
70.1% to 61.5%. With BMMI/SAT, the relative gain between 64.3%
WER on MFCCs and 58.3% WER on DBNFs was 9.3%, which is
still high but lower than the one achieved on the ML system. Similar
results were obtained on the babel106b-v0.2g-sub-train subset.
By performing a confusion-network combination of the MFCC
and DBNF systems trained with BMMI/SAT, the recognition accu-
racy could be raised to 56.6% WER for the full system and 66.9%
WER for the system trained on the subset.
Full system (69h)
Limited system (9h)
encoder network on Tagalog babel106-v0.2f (full) and babel106b-
v0.2g-sub-train (limited) systems (in % WER). The DBNF-pre-full
network was pre-trained on the full dataset and fine-tuned on the
Results with deep bottleneck features from a 5-auto-
On Switchboard, we trained networks with 4 auto-encoders
on subsets of the data and used the bottleneck features to set up a
context-dependent system on the full 300 hour dataset. Using the
features from a network trained on 60 hours of speech lowered the
word error rate from 39.0% to 36.1% (7.4% relative). Doubling the
amount of training data for the neural network resulted in further im-
provement and produced a WER of 35.6%, which is a 8.7% relative
gain over the MFCC baseline.
In this work, we have proposed a new setup for extracting bottle-
neck features from deep neural networks and have shown its ability
to achieve significant improvements over a number of MFCC base-
line systems on different datasets. The model used is able to produce
further gains by pre-training the stack of auto-encoders on more un-
labeled data. This turned out to be more useful if only very little
labeled data can be used for supervised fine-tuning and system train-
ing. We have also demonstrated that denoising auto-encoders are
applicable for modeling speech data and initializing deep networks.
The results presented support hypotheses from earlier works in
that log mel scale coefficients are more suitable input features for
deep neural networks than MFCCs and that pre-training is generally
beneficial but especially crucial when not much data is available.
Further work will deal with optimizing input feature vectors and
system combinations as it is currently being done with standard bot-
tleneck features. It might be interesting to compare denoising auto-
encoders with the more widely used RBMs for pre-training and to
fine-tune the network to predict context-dependent targets as was
done in previous works.
Supported in part by the Intelligence Advanced Research Projects
Activity (IARPA) via Department of Defense U.S. Army Research
Laboratory (DoD/ARL) contract number W911NF-12-C-0015. The
U.S. Government is authorized to reproduce and distribute reprints
for Governmental purposes notwithstanding any copyright anno-
tation thereon. Disclaimer: The views and conclusions contained
herein are those of the authors and should not be interpreted as
necessarily representing the official policies or endorsements, either
expressed or implied, of IARPA, DoD/ARL, or the U.S. Govern-
 H. Bourlard and N. Morgan, Connectionist speech recognition:
a hybrid approach, vol. 247, Springer, 1994.
 H. Hermansky, D.P.W. Ellis, and S. Sharma, “Tandem connec-
tionist feature extraction for conventional HMM systems,” in
Acoustics, Speech, and Signal Processing, 2000. ICASSP’00.
Proceedings. 2000 IEEE International Conference on. IEEE,
2000, vol. 3, pp. 1635–1638.
 F. Gr´ ezl and P. Fousek, “Optimizing bottle-neck features for
LVCSR,” in Acoustics, Speech and Signal Processing, 2008.
ICASSP 2008. IEEE International Conference on. IEEE, 2008,
 G.E. Hinton, S. Osindero, and Y.W. Teh, “A fast learning algo-
rithm for deep belief nets,” Neural computation, vol. 18, no. 7,
pp. 1527–1554, 2006.
 Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle,
“Greedy layer-wise training of deep networks,” Advances in
neural information processing systems, vol. 19, pp. 153, 2007.
 M.A. Ranzato, F.J. Huang, Y.L. Boureau, and Y. LeCun, “Un-
supervised learning of invariant feature hierarchies with appli-
cations to object recognition,” in Computer Vision and Pat-
tern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE,
2007, pp. 1–8.
 A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J.
Lang,“Phoneme recognition using time-delay neural net-
works,” Acoustics, Speech and Signal Processing, IEEE Trans-
actions on, vol. 37, no. 3, pp. 328–339, 1989.
 R.P. Lippmann, “Review of neural networks for speech recog-
nition,” Neural computation, vol. 1, no. 1, pp. 1–38, 1989.
 G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, et al., “Deep
neural networks for acoustic modeling in speech recognition,”
IEEE Signal Processing Magazine, 2012.
 F. Seide, G. Li, and D. Yu, “Conversational speech transcrip-
tion using context-dependent deep neural networks,” in Proc.
Interspeech, 2011, pp. 437–440.
 G.E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent
pre-trained deep neural networks for large-vocabulary speech
recognition,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 20, no. 1, pp. 30–42, 2012.
 D. Yu and M.L. Seltzer, “Improved bottleneck features using
pretrained deep neural networks,” Proc. Interspeech 2011, pp.
 A. Mohamed, G. Hinton, and G. Penn, “Understanding how
deep belief networks perform acoustic modelling,” in Acous-
tics, Speech and Signal Processing (ICASSP), 2012 IEEE In-
ternational Conference on. IEEE, 2012, pp. 4273–4276.
 L. Mangu, H.K. Kuo, S. Chu, B. Kingsbury, G. Saon,
H.Soltau, andF.Biadsy, “TheIBM2011GALEArabicspeech
transcription system,” in Automatic Speech Recognition and
Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011,
 T.N. Sainath, B. Kingsbury, and B. Ramabhadran,
encoder bottleneck features using deep belief networks,” in
Acoustics, Speech and Signal Processing (ICASSP), 2012
IEEE International Conference on. IEEE, 2012, pp. 4153–
 P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Ex-
tracting and composing robust features with denoising autoen-
coders,” in Proceedings of the 25th international conference
on Machine learning. ACM, 2008, pp. 1096–1103.
 M. Finke, P. Geutner, H. Hild, T. Kemp, K. Ries, and M. West-
phal, “The Karlsruhe-Verbmobil speech recognition engine,”
in Acoustics, Speech, and Signal Processing, 1997. ICASSP-
97., 1997 IEEE International Conference on. IEEE, 1997,
vol. 1, pp. 83–86.
 H. Soltau, F. Metze, C. Fugen, and A. Waibel, “A one-pass
decoder based on polymorphic linguistic context assignment,”
in Automatic Speech Recognition and Understanding, 2001.
ASRU’01. IEEE Workshop on. IEEE, 2001, pp. 214–217.
 T. Schaaf and F. Metze, “Analysis of gender normalization
using MLP and VTLN features,” in Proc. Interspeech, 2010.
 James Bergstra, Olivier Breuleux, Fr´ ed´ eric Bastien, Pas-
cal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph
Turian, David Warde-Farley, and Yoshua Bengio, “Theano: a
CPU and GPU Math Expression Compiler,” in Proceedings of
the Python for Scientific Computing Conference (SciPy), June
2010, Oral Presentation.
 “IARPA, Office for Incisive Analysis, Babel Program,”
babel.html, Retrieved 2013-03-06.
 J.J. Godfrey, E.C. Holliman, and J. McDaniel,
BOARD: Telephone speech corpus for research and develop-
ment,”in Acoustics, Speech, and Signal Processing, 1992.
ICASSP-92., 1992 IEEE International Conference on. IEEE,
1992, vol. 1, pp. 517–520.