Conference Paper

Reader: Speech Synthesizer and Speech Recognizer

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The text-to-speech and speech-to-text functionalities have become an integral part of our lives in this digital era. This paper proposes a system that provides a way to generate audio from text and listen to the user’s microphone and convert speech-to-text. This system shall be implemented with various user-friendly features. The target audience of this system is people with disabilities like dyslexia, reading challenges, or visual impairment which results in difficulty in reading or writing. This system can aid them and help them use technology. Its simplicity and ease of use make it stand apart from such existing systems.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Maine drove over to Maine Dr" [30]. The first "Dr" needs to be "doctor" ...
Thesis
Full-text available
*in partial fulfilment for the award of the degree of Bachelor of Computer Applications* This research proposes a classification of recommender systems based upon the filtering techniques used for refining the recommendations. The aim is to explore different types of filtering techniques and ideas and use them to classify the recommender systems with justifications, their advantages, and disadvantages, and then compare these techniques on the basis of input collection, processing data, and output as a recommendation. The classification is on the basis of how a system treats the data, the collection process of the data, approaches of collecting and processing the data, and the area of application. In the research, we have classified the recommender systems using various filtering techniques, viz. collaborative filtering, content-based filtering, demographic filtering, and knowledge-based filtering techniques of data processing, approaches, and applications. These techniques have been diagrammatically represented throughout the chapters to best explain the scenarios where these techniques can be helpful. The research also discusses the shortfalls of these techniques in their respective region and where and how they overcome the limitations of other techniques described throughout the research.
Article
Full-text available
Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech.
Conference Paper
Full-text available
The paper presents a system for speech recognition and synthesis for the Kazakh and Russian languages. It is designed for use by speakers of Kazakh; due to the prevalence of bilingualism among Kazakh speakers, it was considered essential to design a bilingual Kazakh-Russian system. Developing our system involved building a text processing and transcription system that deals with both Kazakh and Russian text, and is used in both speech synthesis and recognition applications. We created a Kazakh TTS voice and an additional Russian voice using the recordings of the same bilingual voice artist. A Kazakh speech database was collected and used to train deep neural network acoustic models for the speech recognition system. The resulting models demonstrated sufficient performance for practical applications in interactive voice response and keyword spotting scenarios.
Article
Full-text available
This study examines optimal conversions of speech sounds to audible electric currents in cochlear-implant listeners. The speech dynamic range was measured for 20 consonants and 12 vowels spoken by five female and five male talkers. Even when the maximal root-mean-square (rms) level was normalized for all phoneme tokens, both broadband and narrow-band acoustic analyses showed an approximately 50-dB distribution of speech envelope levels. Phoneme recognition was also obtained in ten CLARION implant users as a function of the input dynamic range from 10 to 80 dB in 10-dB steps. Acoustic amplitudes within a specified input dynamic range were logarithmically mapped into the 10-20-dB range of electric stimulation typically found in cochlear-implant users. Consistent with acoustic data, the perceptual data showed that a 50-60-dB input dynamic range produced optimal speech recognition in these implant users. The present results indicate that speech dynamic range is much greater than the commonly assumed 30-dB range. A new amplitude mapping strategy, based on envelope distribution differences between consonants and vowels, is proposed to optimize acoustic-to-electric mapping of speech sounds. This new strategy will use a logarithmic map for low-frequency channels and a more compressive map for high-frequency channels, and may improve overall speech recognition for cochlear-implant users.
Conference Paper
Voice assistant technology has expanded the design space for voice-activated consumer products and audio-centric user experience. To navigate this emerging design space, Speech Synthesis Markup Language (SSML) provides a standard to characterize synthetic speech based on parametric control of the prosody elements, i.e. pitch, rate, volume, contour, range, and duration. However, the existing voice assistants utilizing Text-to-Speech (TTS) lack expressiveness. The need of a new production workflow for more efficient and emotional audio content using TTS is discussed. A prototype that allows a user to produce TTS-based content in any emotional tone using voice input is presented. To evaluate the new workflow enabled by the prototype, an initial comparative study is conducted against the parametric approach. Preliminary quantitative and qualitative results suggest the new workflow is more efficient based on time to complete tasks and number of design iterations, while maintaining the same level of user preferred production quality.
Conference Paper
Grapheme to phoneme conversion modules are essential components in text-to-speech (TTS) systems. These modules operate before the phone sequence is fed into the syn- thesis routine. However, additional challenges emerge when such conversion modules are implemented with non-native varieties of languages, such as Indian English. As many of the the existing grapheme to phoneme dictionaries represent American English pronunciation, they are not suitable for use in Indian English TTS systems. Hence, in this work, an effort has been made to modify the existing English grapheme to phoneme dictionary by implementing specific rules for one particular variety of Indian English, namely Assamese English. The proposed method of dictionary modification is applied at the front end of the Indian English TTS, developed using unit selection synthesis and statistical parametric speech synthesis frameworks. In both frameworks, significant improvement is achieved in subjective evaluation when the dictionary is adapted to Assamese English pronunciation. The word error rate decreased from 46.67% to 7.69% after incorporating the variety specific modifications to the dictionary, indicating significant perceptual improvement.
Conference Paper
Now a days English Text to Speech (TTS) Engines are available online as well as offline. There is very less work has been done in Devanagari TTS systems especially for Marathi text. However, in this paper a new method is proposed for a Devanagari (Marathi) Text To Speech system. Here a new technique is proposed with mapper and combiner in order to build Marathi TTS system. The Marathi TTS system is designed using existing English TTS Engine. Marathi input text is mapped with the text present in the database using simple Linear search algorithm then it is provided as input to the Existing English TTS. Currently Concatenative speech synthesis is the method mostly used in TTS systems. Main drawbacks in Concatenation Synthesis, such as glitches, reverberations, spectral mismatch and requirement for huge database are removed up to great extent. The proposed system introduces new concept, which is more feasible and easier than the earlier methods used for Marathi Text-To-Speech. Also the proposed method provides maximum accuracy for text mapping.
Article
Text-to-speech) systems are used invariably as part of our daily lives and have come a long way. In this paper TTS system using Concatenative synthesis based on the SDK (Software Development Kit) platform has been presented. This system is compatible with both computer and mobile devices. It has a user friendly GUI (graphical user interface) to control various speech parameters. Speech signal produced can be saved and listened to whenever required. Signal analysis of the output speech can also be done using TTS System. The results of these signal analysis along with the stored speech signal can be used for further applications depending upon the requirements. It is an intelligent system and is able to overcome various normalization problems.
Conference Paper
A real time phonetic voice synthesizer roughly the size of a small hi-fi amplifier has been developed. It accepts a string of phoneme commands, each consisting of 8 bits. 6 bits determine the phoneme uttered while 2 bits determine the inflection associated with that phoneme. The synthesizer contains an active filter network which simulates the transfer function of the human vocal tract. This analog network is excited by both voicing and fricative sound sources. The sound sources and the vocal tract filter transfer function are dynamically manipulated in response to the numerous phoneme command sequences to produce articulatory synthesis by rule.