Fig 1 - uploaded by Pijus Kasparaitis
Content may be subject to copyright.
Source publication
The problem of speech corpus for design of human-computer interfaces working in voice recognition and synthesis mode is investigated. Specific requirements of speech corpus for speech recognizers and synthesizers were accented. It has been discussed that in order to develop above mentioned speech corpus, it has to consist of two parts. One part of...
Context in source publication
Context 1
... process of development of Part 1 is presented in Fig. 1. It is similar to processes of many developers of corpora and to the particular process, given in Kazlauskien˙ e and Raškinis (2013). The difference of this particular and presented process is the extent of handwork. The process depicts the main phases in sequential order. In practice, all the phases are interlaced; different stages ...
Similar publications
Machine learning is used nowadays in several fields, one of them is speech recognition. Controlling devices with our speech, recognising the music we hear, creating subtitles for a video based on the audio, we use more and more features related to speech and voice recognition. In general, we are glad to use such technology-in those cases when it pr...
In the most recent years, numerous approaches have been accomplished in the discipline of human voice recognition for building speaker identification systems. Frequency and time domain techniques are widely used in extracting human voice features. This paper presents the most robust and most popular Mel-frequency cepstral coefficient (MFCC) techniq...
Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increase the accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user to speak into the microphone then it will match unknown vo...
Many applications focused on discovering the characteristics of the face and
recognizing faces. Many used these researches for different purposes in daily life, but
the characteristics of the voice were not sufficiently attractive to researchers due to
the presence of many difficulties.
In this paper, we present a new method for discovering the...
The purpose of this study is that using different deep learning models for classification of 14 different animals. Deep Learning, an area of artificial intelligence, has been used in a wide range of recent years. Especially, it using in advanced level of image processing, voice recognition and natural language processing fields. One of the most imp...
Citations
... For testing, 15 data files were taken from the Liepa project (Laurinčiukaitė et al., 2018), and 2 more were taken from non-public sources. The files were selected to have approximately equal amounts of read and spontaneous speech, male and female voices. ...
For more than two decades, Lithuanian speech recognition has been researched solely in Lithuania due to the need for deep knowledge of Lithuanian. AI advancements now allow high-quality speech-to-text systems to be built without native knowledge, given sufficient annotated data is available. This study evaluated as many as 18 Lithuanian speech transcribers using a small piece of recording; 7 best ones were selected and evaluated using extensive data. The top system achieved a WER of 5.1% for Lithuanian words, with three others showing 8.7–9.2%. For other word-size tokens, such as numbers, speech disfluencies, abbreviations, foreign words, a classification adapted to the Lithuanian language was proposed. Different processing strategies for tokens of these classes were examined and it was assessed which transcribers tend to follow which strategies.
... extremely low-resource languages ), sličan ranije navedenim pristupima s engleskim početnim modelom. Unatoč činjenici da Litavski korpus Liepa sadrži preko 100 sati govora (Laurinčiukaitė et al. 2018), eksperimenti u radu provedeni su s 3,7 minuta i 1,29 sati govora. Rezultati prirodnosti litavskog govora za model temeljen na 1,29 sati su 3,65 u odnosu na 4,01 prirodna govora. ...
As digital communication becomes increasingly prevalent, the development of speech synthesis systems for Croatian and related languages is of paramount importance. This paper provides an in–depth exploration into the field of speech synthesis, emphasizing the Croatian language. It chronologically charts the evolution of speech synthesis from its mechanical inception to the modern electronic age, culminating in an analysis of contemporary landscape of digital speech synthesis systems. The study commences with a synthesis of previous research on Croatian speech synthesis, scrutinizing the methodologies and strategies implemented, and evaluating their effectiveness, constraints, and results. A comparative study is also presented, assessing advancements in related Slavic languages, including Serbian, Slovene, Bosnian, and Macedonian. The discourse then widens to include the global landscape of speech synthesis. It highlights the latest breakthroughs, particularly cutting–edge techniques, frameworks, and algorithms that have yielded significant outcomes in languages with abundant linguistic resources, such as English and Mandarin Chinese. This comparison elucidates the notable gaps in speech synthesis progress on a global scale. The paper also addresses the challenges posed by the scarce and suboptimal quality digital linguistic resources available in Croatia, which hinder the development of speech synthesis. In response to these challenges, the paper introduces a doctoral thesis dedicated to creating an annotated corpus and formulating deep learning models specifically tailored for Croatian speech synthesis. The ambition of this scholarly work is to catalyze advancement, remedy existing shortcomings, and pave the way for a robust future for Croatian speech synthesis technology. In conclusion, this survey examines both the historical trajectory and the present state of speech synthesis in Croatian. It underscores the criticality of ongoing research in this area and the urgent necessity for enhanced linguistic resources and innovative methodologies. The paper also briefly touches upon the significant progress in speech synthesis for globally dominant languages, such as English and Mandarin Chinese, providing a benchmark for future investigations.
... The humanoid robot communicating with the child educates the child, develops the child's skills by teaching the child what decisions the child must make when moving an object Q that is in state A through environment S to state B with a minimum of effort, time and material costs, resources. The child's education is achieved by communicating with the humanoid robot in Lithuanian speech recognition and synthesis engine (Laurinčiukaitė et al., 2018). ...
... It is generally accepted that intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans. The elements of verbal HRI using the Lithuanian language are: A -Lithuanian speech recognizer (Greibus et al., 2017), which recognizes speech and converts it into understandable commands, codes and symbols for the humanoid robot, S -Lithuanian speech synthesizer (Laurinčiukaitė et al., 2018), which gives suggestions or advice verbally in using Lithuanian. A and S are installed in the hardware located in the head of the humanoid robot. ...
... Stimuli for the tasks were synthesized using the Merlin Toolkit for Deep Neural Network models adapted for Lithuanian language [18]. The Lithuanian language corpus LIEPA created in Vilnius University was used to train the model [19]. Three qualities were obtained for the synthetic stimuli by using different amounts of training data: low quality (training data consisted of 400 sentences), medium quality (training data consisted of 800 sentences) and high quality (training data consisted of 1600 sentences). ...
... Data for training were taken from the project LIEPA (Laurinčiukaitė et al., 2018). One of the aims of project LIEPA was to create speech corpora for development of speech recognition and synthesis. ...
The present paper deals with choosing the input alphabet for the end-to-end synthesizer of the Lithuanian language. Tacotron 2 is a state-of-the-art end-to-end speech synthesis model. Characters, phonemes or their combinations can be used as an input of the model. The model was trained on Lithuanian speech recordings using the following five input alphabets: letters, lowercase letters, accented letters, reduced set of accented letters, letters with separate accent marks. Acceptability of the synthesized speech was evaluated on the basis of human listeners’ subjective judgment. Experimental testing showed that accent marks significantly improved the quality of the synthesized speech. Reducing the size of the input alphabet also has a slight positive impact. Putting accent marks into the text produced the best results as compared to using the accented letters.
... English competing speech contained sentences from the IEEE lists [10] produced by a native British English male speaker [11]. Lithuanian competing speech consisted of text being read aloud by a native Lithuanian male speaker from the LIEPA corpus [12]. Both these speakers did not wear a face mask. ...
This paper examines the perception of speech produced with face masks in everyday multi-talker environments. Three groups of participants listened to English target sentences produced with or without a face mask in the presence of English or Lithuanian competing speech. Participants were monolingual English listeners, and second language English listeners with either Lithuanian or Mandarin Chinese as first language (L1). Lithuanian listeners also completed the experiment with Lithuanian target sentences. Participants were generally more accurate perceiving sentences produced without a face mask, and when listening in L1. Competing speech in a language matching the target lowered perception accuracy. Exceptionally, only when Lithuanian participants (with both English and Lithuanian knowledge) listened for Lithuanian targets was there no added challenge from matching language of target and competing speech. We conclude that acoustic distortions from face masks present an across-the-board difficulty while linguistic knowledge can reduce distraction from competing speech.
... Laurinciukaite et al. analyzed the existing processing algorithms, adopted a method that can be well combined with statistical models, and implemented it. Through experiments, we compare the tagging accuracy of the system before and after adding the new word processing algorithm [17]. ...
To study the influence of conventional literature on foreign literature driven by big data, this essay begins with surveys and interviews. Chinese big data-driven corpora are distinct from other Chinese corpora, as is widely known. Its main objective is to categorize professional corpora that are unknown and fall within the category of professional corpora. In order to provide a straightforward and useful domain partitioning model for corpus texts, this research makes use of text clustering and big data-driven methodologies. We can easily determine the domain of the aligned text, making it easier to do machine translation research in the future. The research findings demonstrate that the accuracy rate of the approach suggested in this article is essentially above 89.79%, demonstrating the viability of the way of automatically building a corpus suggested in this paper in the experiment.
... Global services, containing a cross-linguistic environment, bring new challenges in the speech emotion recognition (SER) tasks. Different cultural contexts, divergent acoustic properties, and their variation within languages, the lack of multilingual or cross-language datasets make the task of cross-linguistic SER non-trivial, and the progress in this field is below our expectations [7]. ...
In this research, a study of cross-linguistic speech emotion recognition is performed. For this purpose, emotional data of different languages (English, Lithuanian, German, Spanish, Serbian, and Polish) are collected, resulting in a cross-linguistic speech emotion dataset with the size of more than 10.000 emotional utterances. Despite the bi-modal character of the databases gathered, our focus is on the acoustic representation only. The assumption is that the speech audio signal carries sufficient emotional information to detect and retrieve it. Several two-dimensional acoustic feature spaces, such as cochleagrams, spectrograms, mel-cepstrograms, and fractal dimension-based space, are employed as the representations of speech emotional features. A convolutional neural network (CNN) is used as a classifier. The results show the superiority of cochleagrams over other feature spaces utilized. In the CNN-based speaker-independent cross-linguistic speech emotion recognition (SER) experiment, the accuracy of over 90% is achieved, which is close to the monolingual case of SER.
... The data setting in Lithuanian is similar to that in English. We select a subset of Liepa corpus [20] and only use the characters as the raw texts. The D h contains 50 paired text and speech data (3.7 minutes), D l contains 1000 paired text and speech data (1.29 hours), Y u seen contains 4000 unpaired speech data (5.1 hours), Y u unseen contains 5000 unpaired speech data (6.7 hours), and X u contains 20000 unpaired texts. ...
... The data setting in Lithuanian is similar to that in English. We select a subset of Liepa corpus [20] and only use the characters as the raw texts. The D h contains 50 paired text and speech data (3.7 minutes), D l contains 1000 paired text and speech data (1.29 hours), Y u seen contains 4000 unpaired speech data (5.1 hours), Y u unseen contains 5000 unpaired speech data (6.7 hours), and X u contains 20000 unpaired texts. ...
Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training. However, there are more than 6,000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely low-resource languages. In this paper, we develop LRSpeech, a TTS and ASR system under the extremely low-resource setting, which can support rare languages with low data cost. LRSpeech consists of three key techniques: 1) pre-training on rich-resource languages and fine-tuning on low-resource languages; 2) dual transformation between TTS and ASR to iteratively boost the accuracy of each other; 3) knowledge distillation to customize the TTS model on a high-quality target-speaker voice and improve the ASR model on multiple voices. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for TTS in terms of both intelligibility (more than 98% intelligibility rate) and naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech, which satisfy the requirements for industrial deployment, 2) achieves promising recognition accuracy for ASR, and 3) last but not least, uses extremely low-resource training data. We also conduct comprehensive analyses on LRSpeech with different amounts of data resources, and provide valuable insights and guidances for industrial deployment. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.