Conference PaperPDF Available

Abstract and Figures

Speech processing is a data-driven technology that relies on public corpora and associated resources. In contrast to languages such as English, there are few resources for Brazilian Portuguese (BP). This work describes efforts toward decreasing such gap and presents systems for speech recognition in BP using two public corpora: Spoltech and OGI-22. The following resources are made available: HTK scripts, pronunciation dictionary, language and acoustic models. The work discusses the baseline results obtained with these resources.
Content may be subject to copyright.
Spoltech and OGI-22 Baseline Systems for
Speech Recognition in Brazilian Portuguese
Nelson Neto*, Patrick Silva*, Aldebaro Klautau*, and Andre Adami¯
(*)Universidade Federal do Par´a, Signal Processing Laboratory,
Rua Augusto Correa. 1, 660750110 Bel´em, PA, Brazil and
(¯)Universidade de Caxias do Sul,
Rua Francisco Get´ulio Vargas. 1180, 95070-560 Caxias do Sul, RS, Brazil
Abstract. Sp eech processing is a data-driven technology that relies on
public corpora and associated resources. In contrast to languages such as
English, there are few resources for Brazilian Portuguese (BP). This work
describ es efforts toward decreasing such gap and presents systems for
sp eech recognition in BP using two public corpora: Spoltech and OGI-22.
The following resources are made available: HTK scripts, pronunciation
dictionary, language and acoustic models. The work discusses the baseline
results obtained with these resources.
Key words: Speech recognition, Brazilian Portuguese, HMMs, pronunciation
1 Introduction
This work discusses current efforts within the FalaBrasil initiative [1]. The overall
goal is to develop and deploy automatic speech recognition (ASR) resources and
software for BP, aiming to establish baseline systems and allow for reproducing
results across different sites. More specifically, the work presents resources and
results for two baseline systems using the Spoltech and OGI-22 corpora. All
corrected transcriptions and resources can be found in [1].
2 UFPAdic: a pronunciation dictionary for BP
In [2], a hand-labeled pronunciation dictionary UFPAdic version 1 with 11,827
words in BP was released within the FalaBrasil initiative. The phonetic tran-
scriptions adopted the SAMPA alphabet and were validated by comparing results
with other publicly available pronunciation dictionaries for other languages. All
the UFPAdic 1 was used for training a decision tree and adopting the proce-
dure described in [2], a new dictionary was built by selecting the most frequent
words in the CETENFolha corpus [3]. The new dictionary, called UFPAdic 2,
has approximately 60 thousand words.
2 Sp eech Recognition in Brazilian Portuguese
3 Building language models from CETENFolha
Several bigram language models were trained and tested using the HTK tools [4].
The models were trained using 32,100 sentences selected from the CETENFolha
and OGI-22 corpora. Vocabularies with different sizes were created by choosing
the most frequent words in the training set, which were also present in UFPAdic
2. The bigram language models perplexities were computed using 1,000 randomly
selected sentences and are shown in Table 1.
Table 1. LM perplexities for different vocabulary sizes.
Vocabulary size (thousand words) 1.5 3 6 10 15 20 30
Bigram perplexity 47 76 113 136 149 156 165
4 Front-end and acoustic modeling
The initial acoustic models for the 33 phones (32 monophones and a silence
model) used 3-state left-to-right HMMs. After that, triphone models were built
from the monophone models and a decision tree was designed for tying triphones
with similar characteristics [4]. After each step, the models were reestimated
using the Baum-Welch algorithm via HTK tools.
5 OGI-22 Corpus
The 22 Language Telephone Speech Corpus [5], which includes Brazilian Por-
tuguese, is a spontaneous speech and telephone recordings corpus. In this work
the original orthographic transcriptions were corrected, and the nonexistent cre-
ated. For the experiments, the training set was composed of 2,017 files, corre-
sponding to 184.5 minutes, and the test set had 209 files with 14 minutes.
6 Spoltech Corpus
The utterances from Spoltech corpus [6] consist of both read speech and re-
sponses to questions from a variety of regions in Brazil. The acoustic environ-
ment was not controlled, in order to allow for background conditions that would
occur in application environments. In the experiments, the phonetic alphabet
used was the same as the one used in the OGI-22 corpus and a pre-processing
stage removed files that have poor recording quality. The training set was com-
posed by 5,246 files that corresponding to 180 minutes and the test set used the
remaining 2,000 files corresponding to 40 minutes.
Sp eech Recognition in Brazilian Portuguese 3
7 Baseline results
The Spoltech and OGI-22 baseline systems share the same front-end. In addition,
the HMM-based acoustic mo dels of both systems were estimated using the same
procedure described in Section 4.
7.1 Results for bigram LM obtained from the corpora transcriptions
The first experiment used a OGI-22 bigram LM with perplexity equal to 43.
The number of component mixture distributions was gradually increased from
one to ten. The word error rate (WER) reduction can be observed in Fig. 1.
Similarly, a bigram LM with 793 words and perplexity 7 was designed using
only the Spoltech corpus. The resp ective WER results are shown in Fig. 2,
where the number of Gaussians per mixture was also varied from 1 to 10. The
WER with 10-component Gaussian mixtures is 18.6% and 19.92% for Spoltech
and OGI-22, respectively. The experiments finished with 10-component Gaussian
mixtures, because the WER stopped to decline.
Fig. 1. Decrease in WER (%) with the number of Gaussians in each mixture for OGI-22
using a simplified bigram LM with perplexity 43.
7.2 Results with language models including text from CETENFolha
Using the bigram language models mentioned in Section 3, simulations were
performed setting the acoustic model created with the OGI-22 corpus and the
number of Gaussians per mixture equal to ten. The WER for the system with
30,000 words is 35.87%. It can be noticed that increasing the complexity of the
LM does not improve the results given that there is a mismatch between the
CETENFolha text and the OGI-22 sentences.
4 Sp eech Recognition in Brazilian Portuguese
Fig. 2. WER (%) for Spoltech using a simplified bigram LM with perplexity 7.
8 Conclusions
This paper presented some baseline results for ASR in BP. The resources were
made publicly available and allow for reproducing results across different sites.
Future work should concentrate efforts in collecting a larger corpus with broad-
cast news.
This work was partially supported by CNPq, Brazil, project 478022/2006-9 Re-
conhecimento de Voz com Suporte a Grandes Vocabul´arios para o Portuguˆes
Brasileiro: Desenvolvimento de Recursos e Sistemas de Referˆencia.
1. “,” Visited in April, 2008.
2. C. Hosn, L. A. N. Baptista, T. Imbiriba, and A. Klautau, “New resources for brazil-
ian portuguese: Results for grapheme-to-phoneme and phone classification,” In VI
International Telecommunications Symposium, Fortaleza, 2006.
3. “,” Visited in January, 2008.
4. S. Young, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book (for HTK
Version 3.4). Cambridge University Engineering Department, 2006.
5. T. Lander, R. Cole, B. Oshika, and M. Noel, “The ogi 22 language telephone speech
corpus,” In: Proc. Eurospeech’95, Madrid , 1995.
6. “Advancing human language technology in Brazil and the United states through
collab orative research on portuguese spoken language systems,” Federal University
of Rio Grande do Sul, University of Caxias do Sul, Colorado University, and Oregon
Graduate Institute, 2001.
... Em trabalhos anteriores [5], [6], buscou-se criar um sistema de referência utilizando os corpora Spoltech e OGI-22, porém o sistema se limitava a um reconhecimento "controlado", ou seja, treino e teste com características semelhantes (locutores, ambiente de gravação, etc). Já o trabalho atual, possui o objetivo de criar um sistema apto a trabalhar em condições de descasamento acústico entre os corpora de treino e teste. ...
... Despite its progress in the area, the development of robust ASR models for languages other than English can still be considered a difficult task, mainly because state-of-the-art (SOTA) models usually needs many hours of annotated speech for training to achieve good results [2,22]. This can be a challenge for some languages, such as Brazilian Portuguese, that has just a fraction of open resources available, if compared to the English language [17,18]. ...
Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe a sentence in audio in a sequence of words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, as Brazilian Portuguese. In this sense, this work presents the development of an public Automatic Speech Recognition system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages over Brazilian Portuguese data. The final model presents a Word Error Rate of 11.95% (Common Voice Dataset). This corresponds to 13% less than the best open Automatic Speech Recognition model for Brazilian Portuguese available according to our best knowledge, which is a promising result for the language. In general, this work validates the use of self-supervising learning techniques, in special, the use of the Wav2vec 2.0 architecture in the development of robust systems, even for languages having few available data.
... All audio samples have been recorded at 44.1 kHz, and the acoustic environment was not controlled. As pointed in [120], some audio records do not have their corresponding transcriptions, and many of these records contain both transcriptions with errors, such as misspelling or typos. Also, they have used 189 phonetic symbols, many of them with few occurrences. ...
Full-text available
In this work, we present a character-based end-to-end speech recognition system for Brazilian Portuguese (PT-BR) using deep learning. We have developed our own dataset — an ensemble of four datasets (three publicly available) that we had available. We have conducted several tests varying the number of layers, applying different regularization methods, and fine-tuning several other hyperparameters. Our best model achieves a label error rate of 25.13% on our test set, 11% higher than commercial systems do. This first effort shows us that building an all-neural speech recognition system for PT-BR is feasible.
... Of its 8,080 utterances recorded at 44.1 kHz sample rate in non-controlled environment, 2,540 have been transcribed at word level alignment, and 5,479 at phoneme level with time alignment. As pointed in[16], some audio recordings have lacking or erroneous transcriptions. Three distinct sets must be defined: train, validation, and test. ...
Conference Paper
Full-text available
This paper presents an open-source character-based end-to-end speech recognition system for Brazilian Portuguese (PT-BR). The first step of the work was the development of a PT-BR dataset—an ensemble of 4 previous datasets (of which 3 publicly available). The model trained on this dataset is a bidirectional long short-term memory network using connection-ist temporal classification for end-to-end training. Several tests were conducted to find the best set of hyperparameters. Without a language model, the system achieves a label error rate of 31.53% on the test set, about 17% higher than commercial systems with a language model. This first effort shows that an all-neural high-performance speech recognition system for PT-BR is feasible.
... Em trabalhos anteriores [11,12], buscou-se criar um sistema de referência utilizando os ...
Full-text available
Este trabalho descreve a implementação de um software de reconhecimento de voz para o Português Brasileiro. Dentre os objetivos do trabalho tem-se a construção de um sistema de voz contínua para grandes vocabulários, apto a ser usado em aplicações em tempo-real. São apresentados os principais conceitos e características de tais sistemas, além de todos os passos necessários para construção. Como parte desse trabalho foram produzidos e disponibilizados vários recursos: modelos acústicos e de linguagem, novos corpora de voz e texto. O corpus de texto vem sendo construído através da extração e formatação automática de textos de jornais na Internet. Além disso, foram produzidos dois corpora de voz, um baseado em audiobooks e outro produzido especificamente para simular testes em tempo-real. O trabalho também propõe a utilização de técnicas de adaptação de locutor para resolução de ploblemas de descasamento acústico entre corpora de voz. Por último, é apresentada uma interface de programação de aplicativos que busca facilitar a utilização do decodificador Julius. Testes de desempenho são apresentados, comparando os sistemas desenvolvidos e um software comercial.
... This is because age influences recognition performance due to several parameters of the speech signal (e.g. fundamental frequency, jitter, shimmer and harmonic noise ratio) changing with age [9,10] and because the acoustic models needed to recognise speech are usually trained using speech collected from the younger generations only. [11] showed that, as compared with middle-aged adults, the increase in word error rate (WER) can be as high as 57% relative when recognising elderly speech (in this case, digits read out by Danish elderly subjects aged over 70) using acoustic models trained with speech collected from young to middle-aged adults. ...
Conference Paper
Full-text available
Currently available speech recognisers do not usually work well with elderly speech because the underlying acoustic models have typically been trained using speech collected from younger adults only. To develop speech-driven systems capable of successfully recognising elderly speech, a sufficient amount of elderly speech data is needed for training the acoustic models. This paper describes an elderly read speech data collection effort car-ried out in Portugal, from the autumn of 2010 until the spring of 2012, apply-ing a low-cost proctored data collection methodology using our in-house data collection platform, Your Speech. The resulting corpus of Portuguese elderly speech contains almost 90 hours of pure speech (speech with silences exclud-ed). The transcriptions for about half of the data have already been manually verified and annotated for noises and disfluencies.
Voice recognition has become more and more popular in various systems and applications. To further promote Macau tourism worldwide, a mobile Macau tourism APP is being developing that supports voice control to facilitate Portuguese users. Consequently, this paper is about the research and implementation of an Automatic Speech Recognition (ASR) engine for Portuguese language. In this research, three well-known open-source ASR platforms were evaluated and compared. The complete ASR development procedure using Kaldi platform is discussed. Due to the limitation of collected voice data, a novel few-shot learning and transfer learning is implemented in this project. The final model achieved a stable 95.25% accuracy which is good enough for production use. The novel technics implemented in this research can be used for ASR trainings with limited training data and can be extended to a wide range of applications in the future.
Full-text available
This paper describes the 1997 CMU DARPA Hub 4 Spanish Broadcast News Transcription system. The system we present is based on the CMU SPHINX-III recognizer and uses a single set of acoustic and language models. The decoding process is per- formed in two passes: a Viterbi search and a directed acyclic graph (DAG) search are performed on the first recognition stage. The second recognition stage is similar to the first stage except that it is performed using models adapted through maximum- likelihood linear regression (MLLR). We describe the issues relating to the design and development of the acoustic models, language models and lexicon. Developmental results and an anal- ysis are presented.
Full-text available
1 — Speech event detection can be used to improve the performance of Automatic Speech Recognition systems by acting as a bottom-up front-end which detects the occurrence of important elements in the speech signal for different sound classes. In a speech recognition system, events can be combined to detect phones, words or sentences, or to identify landmarks to which a classifier or a decoder could be synchronized. This paper deals with a hybrid HMM/SVM architecture for speech event detection. It combines the typical ability of hidden Markov models (HMM) to deal with temporal patterns, with the class separation power of support vector machines (SVM). It inherits from HMM the left-to-right modulation and the Viterbi decoding, and delegates to SVM the emission of class probabilities. Results obtained using the TIMIT corpus are reported and compared to two broad class detectors: based on HMM with a MFCC front-end and based on SVM with a set of acoustic parameters as front end. It was found that the proposed hybrid system outperforms these two modalities individually in what concerns to accuracy and also to the quality of the detected boundaries.
Conference Paper
Full-text available
The 22 Language Telephone Speech Corpus isthe newest effort in CSLU multi-language corpusdevelopment. Since November of 1994 wehave been collecting calls from speakers of 22languages. The completed corpus will containat least 200 speakers per language. All calls areverified by native speakers to insure that callersfollowed instructions. In addition a subset of thecalls have been transcribed by native speakers ofthe different languages. Conventions are beingdeveloped for phonetic...
We collected a corpus of parallel text in 11 lan-guages from the proceedings of the European Par-liament, which are published on the web 1 . This cor-pus has found widespread use in the NLP commu-nity. Here, we focus on its acquisition and its appli-cation as training data for statistical machine trans-lation (SMT). We trained SMT systems for 110 lan-guage pairs, which reveal interesting clues into the challenges ahead.
We describe a procedure for acquiring intonational phrasing rules for text-to-speech synthesis automatically, from annotated text, and some evaluation of this procedure for English and Spanish. The procedure employs decision trees generated automatically, using Classification and Regression Tree techniques, from text corpora which have been hand-labeled by native speakers with likely locations of intonational boundaries, in conjunction with information available about the text via simple text analysis techniques. Rules generated by this method have been implemented in the English version of the Bell Laboratories Text-to-Speech System and have been developed for the Mexican Spanish version of that system. These rules currently achieve better than 95% accuracy for English and better than 94% for Spanish.
In unknown environments where we need to identify, model, or track unknown and time-varying channels, adaptive filtering has been proven to be an effective tool. In this contribution, we focus on multichannel algorithms in the frequency domain that are especially well suited for input signals which are not only auto-correlated but also highly cross-correlated among the channels. These properties are particularly important for applications like multichannel acoustic echo cancellation. Most frequency-domain algorithms, as they are well known from the single-channel case, are derived from existing time-domain algorithms and are based on different heuristic strategies, e.g, for stepsize normalization. Here, we present a new rigorous derivation of a whole class of multichannel adaptive filtering algorithms in the frequency domain based on a recursive least-squares criterion. Then, from the normal equation, we derive a generic adaptive algorithm in the frequency domain. Due to the rigorous approach, the proposed framework inherently takes the coherence between all input signal channels into account. An analysis of this multichannel algorithm shows that the mean-squared error convergence is independent of the input signal statistics (i.e., both auto-correlation and cross-correlation). A useful approximation provides interesting links between some well-known algorithms for the single-channel case and the general multichannel framework. We also give design rules for important parameters to optimize the performance in practice. The computational complexity is kept low by introducing several new techniques, such as a robust recursive Kalman gain computation in the frequency domain and efficient fast Fourier transform (FFT) computation tailored to overlapping data blocks. Simulation results and real-time performance for applications such as multichannel acoustic echo cancellation show the high efficiency of the approach.
Voice onset time (VOT) is known to vary with place of articulation. For any given place of articulation there are differences from one language to another. Using data from multiple speakers of 18 languages, all of which were recorded and analyzed in the same way, we show that most, but not all, of the within language place of articulation variation can be described by universally applicable phonetic rules (although the physiological bases for these rules are not entirely clear). The between language variation is also largely (but not entirely) predictable by assuming that languages choose one of the three possibilities for the degree of aspiration of voiceless stops. Some languages, however, have VOTs that are markedly different from the generally observed values. The phonetic output of a grammar has to contain language specific components to account for these results.