Speech processing is a data-driven technology that relies on public corpora and associated resources. In contrast to languages such as English, there are few resources for Brazilian Portuguese (BP). This work describes efforts toward decreasing such gap and presents systems for speech recognition in BP using two public corpora: Spoltech and OGI-22. The following resources are made available: HTK scripts, pronunciation dictionary, language and acoustic models. The work discusses the baseline results obtained with these resources.
Spoltech and OGI-22 Baseline Systems for
Speech Recognition in Brazilian Portuguese
Nelson Neto*, Patrick Silva*, Aldebaro Klautau*, and Andre Adami¯
(*)Universidade Federal do Par´a, Signal Processing Laboratory,
Rua Augusto Correa. 1, 660750110 Bel´em, PA, Brazil and
(¯)Universidade de Caxias do Sul,
Rua Francisco Get´ulio Vargas. 1180, 95070-560 Caxias do Sul, RS, Brazil
Speech processing is a data-driven technology that relies on
public corpora and associated resources. In contrast to languages such as
English, there are few resources for Brazilian Portuguese (BP). This work
describ es efforts toward decreasing such gap and presents systems for
sp eech recognition in BP using two public corpora: Spoltech and OGI-22.
The following resources are made available: HTK scripts, pronunciation
dictionary, language and acoustic models. The work discusses the baseline
results obtained with these resources.
Key words: Speech recognition, Brazilian Portuguese, HMMs, pronunciation
1 Introduction
This work discusses current efforts within the FalaBrasil initiative [1]. The overall
goal is to develop and deploy automatic speech recognition (ASR) resources and
software for BP, aiming to establish baseline systems and allow for reproducing
results across different sites. More specifically, the work presents resources and
results for two baseline systems using the Spoltech and OGI-22 corpora. All
corrected transcriptions and resources can be found in [1].
2 UFPAdic: a pronunciation dictionary for BP
In [2], a hand-labeled pronunciation dictionary UFPAdic version 1 with 11,827
words in BP was released within the FalaBrasil initiative. The phonetic tran-
scriptions adopted the SAMPA alphabet and were validated by comparing results
with other publicly available pronunciation dictionaries for other languages. All
the UFPAdic 1 was used for training a decision tree and adopting the proce-
dure described in [2], a new dictionary was built by selecting the most frequent
words in the CETENFolha corpus [3]. The new dictionary, called UFPAdic 2,
has approximately 60 thousand words.
2 Sp eech Recognition in Brazilian Portuguese
3 Building language models from CETENFolha
Several bigram language models were trained and tested using the HTK tools [4].
The models were trained using 32,100 sentences selected from the CETENFolha
and OGI-22 corpora. Vocabularies with different sizes were created by choosing
the most frequent words in the training set, which were also present in UFPAdic
2. The bigram language models perplexities were computed using 1,000 randomly
selected sentences and are shown in Table 1.
Table 1. LM perplexities for different vocabulary sizes.
Vocabulary size (thousand words) 1.5 3 6 10 15 20 30
Bigram perplexity 47 76 113 136 149 156 165
4 Front-end and acoustic modeling
The initial acoustic models for the 33 phones (32 monophones and a silence
model) used 3-state left-to-right HMMs. After that, triphone models were built
from the monophone models and a decision tree was designed for tying triphones
with similar characteristics [4]. After each step, the models were reestimated
using the Baum-Welch algorithm via HTK tools.
5 OGI-22 Corpus
The 22 Language Telephone Speech Corpus [5], which includes Brazilian Por-
tuguese, is a spontaneous speech and telephone recordings corpus. In this work
the original orthographic transcriptions were corrected, and the nonexistent cre-
ated. For the experiments, the training set was composed of 2,017 files, corre-
sponding to 184.5 minutes, and the test set had 209 files with 14 minutes.
6 Spoltech Corpus
The utterances from Spoltech corpus [6] consist of both read speech and re-
sponses to questions from a variety of regions in Brazil. The acoustic environ-
ment was not controlled, in order to allow for background conditions that would
occur in application environments. In the experiments, the phonetic alphabet
used was the same as the one used in the OGI-22 corpus and a pre-processing
stage removed files that have poor recording quality. The training set was com-
posed by 5,246 files that corresponding to 180 minutes and the test set used the
remaining 2,000 files corresponding to 40 minutes.
Sp eech Recognition in Brazilian Portuguese 3
7 Baseline results
The Spoltech and OGI-22 baseline systems share the same front-end. In addition,
the HMM-based acoustic mo dels of both systems were estimated using the same
procedure described in Section 4.
7.1 Results for bigram LM obtained from the corpora transcriptions
The first experiment used a OGI-22 bigram LM with perplexity equal to 43.
The number of component mixture distributions was gradually increased from
one to ten. The word error rate (WER) reduction can be observed in Fig. 1.
Similarly, a bigram LM with 793 words and perplexity 7 was designed using
only the Spoltech corpus. The resp ective WER results are shown in Fig. 2,
where the number of Gaussians per mixture was also varied from 1 to 10. The
WER with 10-component Gaussian mixtures is 18.6% and 19.92% for Spoltech
and OGI-22, respectively. The experiments finished with 10-component Gaussian
mixtures, because the WER stopped to decline.
Fig. 1. Decrease in WER (%) with the number of Gaussians in each mixture for OGI-22
using a simplified bigram LM with perplexity 43.
7.2 Results with language models including text from CETENFolha
Using the bigram language models mentioned in Section 3, simulations were
performed setting the acoustic model created with the OGI-22 corpus and the
number of Gaussians per mixture equal to ten. The WER for the system with
30,000 words is 35.87%. It can be noticed that increasing the complexity of the
LM does not improve the results given that there is a mismatch between the
CETENFolha text and the OGI-22 sentences.
4 Sp eech Recognition in Brazilian Portuguese
Fig. 2. WER (%) for Spoltech using a simplified bigram LM with perplexity 7.
8 Conclusions
This paper presented some baseline results for ASR in BP. The resources were
made publicly available and allow for reproducing results across different sites.
Future work should concentrate efforts in collecting a larger corpus with broad-
cast news.
This work was partially supported by CNPq, Brazil, project 478022/2006-9 Re-
conhecimento de Voz com Suporte a Grandes Vocabul´arios para o Portuguˆes
Brasileiro: Desenvolvimento de Recursos e Sistemas de Referˆencia.
1. “,” Visited in April, 2008.
2. C. Hosn, L. A. N. Baptista, T. Imbiriba, and A. Klautau, “New resources for brazil-
ian portuguese: Results for grapheme-to-phoneme and phone classification,” In VI
International Telecommunications Symposium, Fortaleza, 2006.
3. “,” Visited in January, 2008.
4. S. Young, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book (for HTK
Version 3.4). Cambridge University Engineering Department, 2006.
5. T. Lander, R. Cole, B. Oshika, and M. Noel, “The ogi 22 language telephone speech
corpus,” In: Proc. Eurospeech’95, Madrid , 1995.
6. “Advancing human language technology in Brazil and the United states through
collab orative research on portuguese spoken language systems,” Federal University
of Rio Grande do Sul, University of Caxias do Sul, Colorado University, and Oregon
Graduate Institute, 2001.
