ArticlePDF Available

The Kestrel TTS text normalization system

Authors:

Abstract and Figures

This paper describes the Kestrel text normalization system, a component of the Google text-to-speech synthesis (TTS) system. At the core of Kestrel are text-normalization grammars that are compiled into libraries of weighted finite-state transducers (WFSTs). While the use of WFSTs for text normalization is itself not new, Kestrel differs from previous systems in its separation of the initial tokenization and classification phase of analysis from verbalization . Input text is first tokenized and different tokens classified using WFSTs. As part of the classification, detected semiotic classes – expressions such as currency amounts, dates, times, measure phases, are parsed into protocol buffers ( https://code.google.com/p/protobuf/ ). The protocol buffers are then verbalized, with possible reordering of the elements, again using WFSTs. This paper describes the architecture of Kestrel, the protocol buffer representations of semiotic classes, and presents some examples of grammars for various languages. We also discuss applications and deployments of Kestrel as part of the Google TTS system, which runs on both server and client side on multiple devices, and is used daily by millions of people in nineteen languages and counting.
Content may be subject to copyright.
A preview of the PDF is not available
... This essential pre-processing step is part of ensuring that the synthesized speech is fluent, natural, and conveys the intended meaning. Traditionally, TN has used a rules systems such as finite-state transducers (FST) [3,4]. This is due to 2 reasons -(i) need for high precision and (ii) faster processing. ...
... For the same amount of fine-tuning data and training steps, the models using PDS are consistently better. Actually, models 2 number of digits [0-9] * maximum number of digits for normalization, (in our case hundred quintillion or 10 20 ) 3 Results for ELECTRONIC, ADDRESS, FRACTION, and TELE-PHONE were removed from the average because their datasets are highly unreliable fine-tuned w/o PDS only become better than those using PDS when they trained in at least 10 times more data and/or for double the number of steps. Table 1 presents a more dissected view of the accuracy across all different semiotic classes. ...
... This essential pre-processing step is part of ensuring that the synthesized speech is fluent, natural, and conveys the intended meaning. Traditionally, TN has used a rules systems such as finite-state transducers (FST) [3,4]. This is due to 2 reasons -(i) need for high precision and (ii) faster processing. ...
... For the same amount of fine-tuning data and training steps, the models using PDS are consistently better. Actually, models 2 number of digits [0-9] * maximum number of digits for normalization, (in our case hundred quintillion or 10 20 ) 3 Results for ELECTRONIC, ADDRESS, FRACTION, and TELE-PHONE were removed from the average because their datasets are highly unreliable fine-tuned w/o PDS only become better than those using PDS when they trained in at least 10 times more data and/or for double the number of steps. Table 1 presents a more dissected view of the accuracy across all different semiotic classes. ...
Preprint
Full-text available
We present a Positional Description Scheme (PDS) tailored for digit sequences, integrating placeholder value information for each digit. Given the structural limitations of subword tokenization algorithms, language models encounter critical Text Normalization (TN) challenges when handling numerical tasks. Our schema addresses this challenge through straightforward pre-processing, preserving the model architecture while significantly simplifying number normalization, rendering the problem tractable. This simplifies the task and facilitates more compact production-ready models capable of learning from smaller datasets. Furthermore, our investigations reveal that PDS enhances the arithmetic processing capabilities of language models, resulting in a relative accuracy improvement of 23% to 51% on complex arithmetic tasks. We demonstrate that PDS effectively mitigates fatal numerical normalization errors in neural models, requiring only a modest amount of training data without rule-based Finite State Transducers (FST). We demonstrate that PDS is essential for both the Text-To-Speech and Speech Recognition text processing, enabling effective TN under production constraints.
... Text normalization involves converting text into a standard form, reducing redundancy and ensuring consistency [18] Example in Urdu: ...
Article
Full-text available
Natural Language Processing (NLP), being a sub-branch belonging to human languages spoken and written has gained major consideration in the field of computer science, the interaction of humans with computers has emerged as the latest field and has a lot of potential. Natural Language Processing is used in all areas of modern-day systems in all languages. Urdu, a language spoken in South Asia, has received relatively little attention from researchers compared to extensively studied languages like English. Urdu NLP enhancements in the technology era and its applications in the majority of applications used in modern systems will make systems more usable and comprehensible by Urdu language users worldwide. Companies require people's feelings to be not only gauged but also to respond to them in order to make them satisfied customers enhance the customer base and spread a positive word of mouth. This paper presents a review of the Urdu NLP work done till now, as well as our work done in the field of Urdu sentiments analysis, with a focus on its practical usage and potential applications. Additionally, it addresses the present status of Urdu NLP and offers guidance to researchers on future directions for further advancements in this field.
... Research applying ITN in the construction of speech corpora includes methods for generating both written and spoken forms by ASR systems [10] and converting from spoken to written form, considering the specific characteristics of a language [11]. Neural networkbased ITN approaches are superior in learning the complex relationships between spoken and written forms with lower costs through a data-driven approach, compared to Finite State Transducer (FST)-based methods [12,13,14]. Research has been conducted to enable neural network-based ITN models to handle ASR-generated spoken form text that includes irregular insertions of interjections or errors, by using ASR output as training data [15]. ...
Article
Text normalization (TN) is a crucial preprocessing step in text-to-speech synthesis, which pertains to the accurate pronunciation of numbers and symbols within the text. Existing neural network-based TN methods have shown significant success in rich-resource languages. However, these methods are data-driven and highly rely on a large number of labeled datasets, which are not practical in zero-resource settings. Rule-based weighted finite-state transducers (WFST) are a common measure for zero-shot TN, but WFST-based TN approaches encounter challenges with ambiguous input, particularly in cases where the normalized form is context-dependent. On the other hand, conventional neural TN methods suffer from unrecoverable errors. In this paper, we propose ZSTN, a novel zero-shot TN framework based on cross-lingual knowledge distillation, which utilizes annotated data to train the teacher model on rich-resource language and unlabelled data to train the student model on zero-resource language. Furthermore, it incorporates expert knowledge from WFST into a knowledge distillation neural network. Concretely, a TN model with WFST pseudo-labels augmentation is trained as a teacher model in the source language. Subsequently, the student model is supervised by soft-labels from the teacher model and WFST pseudo-labels from the target language. By leveraging cross-lingual knowledge distillation, we address contextual ambiguity in the text, while WFST mitigates unrecoverable errors of the neural model. Additionally, ZSTN is adaptable to different zero-resource languages by using the joint loss function for the teacher model and WFST constraints. We also release a zero-shot text normalization dataset in five languages. We compare ZSTN with seven zero-shot TN benchmarks on public datasets in four languages for the teacher model and zero-shot datasets in five languages for the student model. The results demonstrate that the proposed ZSTN excels in performance without the need for labeled data. All datasets and codes used for this work are released at https://github.com/wlq2019/Zero-Shot-Text-Normalization .
Conference Paper
Full-text available
Incorrect normalization of text can be particularly damaging for applications like text-to-speech synthesis (TTS) or typing auto-correction, where the resulting normalization is directly presented to the user, versus feeding downstream applications. In this paper, we focus on abbreviation expansion for TTS, which requires a "do no harm", high precision approach yielding few expansion errors at the cost of leaving relatively many abbreviations unexpanded. In the context of a largescale, real-world TTS scenario, we present methods for training classifiers to establish whether a particular expansion is apt. We achieve a large increase in correct abbreviation expansion when combined with the baseline text normalization component of the TTS system, together with a substantial reduction in incorrect expansions.
Article
Full-text available
Sproat have written a compact and very readable book surveying computational morphology and computational syntax. This text is not introductory; instead, it will help bring computational linguists who do not work on morphology or syntax up to date on these areas' latest developments. Certain chapters (in particular , Chapters 2 and 8) provide especially good starting points for advanced graduate courses or seminars. The text is divided into an Introduction and Preliminaries chapter, four chapters on computational approaches to morphology, and four chapters on computational approaches to syntax. The morphology chapters focus primarily on formal and theoretical issues, and are likely to be of interest to morphologists, computational and not. The syntax chapters are driven more by engineering goals, with more algorithm details. Because a good understanding of probabilistic modeling is assumed, these chapters will also be useful for machine learning researchers interested in language processing. Despite the authors' former affiliations, this book is not an AT&T analogue of Beesley and Karttunen's (2003) pedagogically motivated text on the Xerox finite-state tools. This text is not about the AT&T FSM libraries or the algorithms underlying them (cf. Roche and Schabes 1997).
Article
Full-text available
In this article we describe HiFST, a lattice-based decoder for hierarchical phrase-based translation and alignment. The decoder is implemented with standard Weighted Finite-State Transducer (WFST) operations as an alternative to the well-known cube pruning procedure. We find that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, better parameter optimization, and improved translation performance. The direct generation of translation lattices in the target language can improve subsequent rescoring procedures, yielding further gains when applying long-span language models and Minimum Bayes Risk decoding. We also provide insights as to how to control the size of the search space defined by hierarchical rules. We show that shallow-n grammars, low-level rule catenation, and other search constraints can help to match the power of the translation system to specific language pairs.
Article
Text-to-Speech Synthesis provides a complete, end-to-end account of the process of generating speech by computer. Giving an in-depth explanation of all aspects of current speech synthesis technology, it assumes no specialized prior knowledge. Introductory chapters on linguistics, phonetics, signal processing and speech signals lay the foundation, with subsequent material explaining how this knowledge is put to use in building practical systems that generate speech. Including coverage of the very latest techniques such as unit selection, hidden Markov model synthesis, and statistical text analysis, explanations of the more traditional techniques such as format synthesis and synthesis by rule are also provided. Weaving together the various strands of this multidisciplinary field, the book is designed for graduate students in electrical engineering, computer science, and linguistics. It is also an ideal reference for practitioners in the fields of human communication interaction and telephony.
Conference Paper
In this paper, we present a new collection of open-source software libraries that provides command line binary utilities and library classes and functions for compiling regular expression and context-sensitive rewrite rules into finite-state transducers, and for n-gram language modeling. The OpenGrm libraries use the OpenFst library to provide an efficient encoding of grammars and general algorithms for building, modifying and applying models.
Conference Paper
Pushdown automata are devices that can efficiently represent context-free languages, have natural weighted versions, and combine naturally with finite automata. We describe a pushdown transducer extension to OpenFst, a weighted finite-state transducer library. We present several weighted pushdown algorithms, some with clear finite-state analogues, describe their library usage and give some applications of these methods to recognition, parsing and translation.
Book
This collection is a very welcome addition to the literature on automatic speech syn-thesis, also known as "text-to-speech" (TTS). It has been more than a decade since a comprehensive, edited collection of chapters on this topic has been published. Because much has changed in TTS research over the last ten years, this text will prove very useful to workers in the field. Along with Dutoit's recent book (An Introduction to Text-to-Speech Synthesis [1997], also published b y Kluwer), it is essential reading for anyone serious about TTS research. (Other recent works have included chapters on TTS, but their main focus has been on other aspects of speech processing, and thus the TTS de-tails there are far fewer. While Bailly and Benoit's collection Talking Machines [1992] has synthesis as its sole focus, it derives from numerous presentations at a workshop and suffers accordingly due to its many uneven short chapters. ) A more direct comparison can be made to Allen, Hunnicutt, and Klatt's book on MITalk (1987), especially as both that book and the one currently under review describe in detail specific TTS systems developed at two of the major research centres involved in speech synthesis over the years. As the foreword of the present book notes, the MITalk system was largely based on morphological decomposition of words and synthesis via a Klatt formant architec-ture, while the modular Bell system is distinguished b y its regular relations for text analysis, its use of concatenation of diphone units, and its emphasis on the importance of careful selection of texts and recording conditions. While they cover similar ground, the newer Sproat book is quite different from Dutoit's. It has seven authors, all contributors to the Bell Labs system that is described in detail in the book. Often in multiauthor books, one finds a significant unevenness in coverage and style across chapters due to lack of coordination among the authors. This is much less apparent in Sproat's b o o k because all authors worked at the same lab and because one author (van Santen) was involved in seven of the chapters; at least one of the two principal authors (Sproat and van Santen) contributed to all nine chapters. Further distinguishing this book is its emphasis on multilingual TTS: the Bell system exists for ten diverse languages, and the book provides many specific examples of interesting problems in different languages. While several multilingual synthesizers are available commercially, most technical literature has focused on one language at a time. Given that all speech synthesis is based on the same h u m a n speech production mechanism and that the world's languages share many aspects of phonetics, it is prudent to examine h o w a uniform methodology can be applied to many different languages for TTS. Unlike speech coders, which normally function equally well for all languages without adjustment, speech recognizers and synthesizers necessarily need training for individual languages. One of the foci of this book is minimizing the work to