Conference Paper

The phoneme set influence for lithuanian speech commands recognition accuracy

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It is generally accepted that intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans. The elements of verbal HRI using the Lithuanian language are: A -Lithuanian speech recognizer (Greibus et al., 2017), which recognizes speech and converts it into understandable commands, codes and symbols for the humanoid robot, S -Lithuanian speech synthesizer (Laurinčiukaitė et al., 2018), which gives suggestions or advice verbally in using Lithuanian. A and S are installed in the hardware located in the head of the humanoid robot. ...
... The digital signal from the microphones can be processed internally by software modules. An additional speech recognition module was created based on an existing generic speechto-text engine (Greibus et al., 2017) developed for the Lithuanian language. This engine, shown in Fig. 4, can be extended and used for many other languages without additional restrictions, such as: Latvian and Estonian. ...
... When the engine provides recognition results, a new event is generated by the application via the robot event management system. With this module there are infinite possibilities to create HRI scenarios (Greibus et al., 2017). ...
... Recently, many GUI-based equipment have been installed for comparison of files. However, no longer a whole lot lookup has been performed on visualizing malware barring [2], [3]. Self-organized maps are used for detection and visualization of malware program in [3]. ...
... However, no longer a whole lot lookup has been performed on visualizing malware barring [2], [3]. Self-organized maps are used for detection and visualization of malware program in [3]. Visualization framework the usage of reverse engineering is carried out in [4]. ...
... However, these strategies are not implemented for classification of malware. In this study, we make use of the comparable work proposed in [2], [3] by means of viewing malware as grayscale images. Classification of malware has been an vicinity attracting great interest. ...
... Recently, many GUI-based equipment have been installed for comparison of files. However, no longer a whole lot lookup has been performed on visualizing malware barring [2], [3]. Self-organized maps are used for detection and visualization of malware program in [3]. ...
... However, no longer a whole lot lookup has been performed on visualizing malware barring [2], [3]. Self-organized maps are used for detection and visualization of malware program in [3]. Visualization framework the usage of reverse engineering is carried out in [4]. ...
... However, these strategies are not implemented for classification of malware. In this study, we make use of the comparable work proposed in [2], [3] by means of viewing malware as grayscale images. Classification of malware has been an vicinity attracting great interest. ...
Article
Full-text available
Anti-Malware enterprise faces the venture of evaluating huge quantity of information for achievable malicious contents. This is due to the truth that hackers introduce polymorphism to the present malicious groups/classes. Effective function extraction and classification of malware facts is indispensable to handle such issues. In this paper, we visualize viruses in an photo as they capture minor modifications whilst maintaining a international structure. Later, we put in force Principal Component Analysis (PCA) approach for feature extraction. Based on extracted PCA features, we study the overall performance of quite a number Artificial Neural Network (ANN) algorithms alongside with K-Nearest Neighbors (NN) and Support Vector Machine (SVM) classification methods for identification of malware facts into their respective classes. We use k-fold validation to gauge the effectiveness of our approach. The study makes use of the publicly reachable Kaggle database provided by Microsoft for the Microsoft Malware Classification Challenge (BIG 2015).
... The considerations above imply that G2P conversion of Lithuanian is quite complex. G2P converter that relies on a word spelling and grapheme rewrite rules (Greibus et al., 2017;Lileikytė et al., 2018), henceforth referred to as a shallow G2P converter, is incapable of resolving ambiguities related to vowel duration, syllable stress, and syllable boundaries and consequently is incapable of producing detailed and consistent allophone sequences. Only G2P converter making use of supplementary pronunciation dictionaries (Skripauskas and Telksnys, 2006) or of accentuation algorithms (Norkevičius et al., 2005;Kazlauskienė et al., 2010), henceforth referred to as a knowledge-rich G2P converter, 4 might be capable of disambiguating and modelling these phonological properties correctly. ...
... The problem of finding the best word to sub-word unit mapping for the applications of Lithuanian ASR was first addressed by Raškinis and Raškinienė (2003), followed by Šilingas (2005), Laurinčiukaitė and Lipeika (2007), Gales et al. (2015), Greibus et al. (2017), Lileikytė et al. (2018), and Ratkevicius et al. (2018). ...
... First, different proprietary speech corpora were used for ASR system training and evaluation (Laurinčiukaitė et al., 2006;Harper, 2016;Laurinčiukaitė et al., 2018). Second, ASR setups were based on different acoustic modelling techniques, such as monophone HMM system (Šilingas, 2005;Ratkevicius et al., 2018), triphone HMM system (Raškinis and Raškinienė, 2003;Šilingas, 2005;Laurinčiukaitė, 2008;Greibus et al., 2017), or hybrid HMM -neural network models (Gales et al., 2015;Lileikytė et al., 2018). Third, different evaluation methodologies were used. ...
Article
Conventional large vocabulary automatic speech recognition (ASR) systems require a mapping from words into sub-word units to generalize over the words that were absent in the training data and to enable the robust estimation of acoustic model parameters. This paper surveys the research done during the last 15 years on the topic of word to sub-word mappings for Lithuanian ASR systems. It also compares various phoneme and grapheme based mappings across a broad range of acoustic modelling techniques including monophone and triphone based Hidden Markov models (HMM), speaker adaptively trained HMMs, subspace gaussian mixture models (SGMM), feed-forward time delay neural network (TDNN), and state-of-the-art low frame rate bidirectional long short term memory (LFR BLSTM) recurrent deep neural network. Experimental comparisons are based on a 50-hour speech corpus. This paper shows that the best phone-based mapping significantly outperforms a grapheme-based mapping. It also shows that the lowest phone error rate of an ASR system is achieved by the phoneme-based lexicon that explicitly models syllable stress and represents diphthongs as single phonetic units.
... Both mentioned authors assumed that a phoneme becomes two new phonemes over time through palatalization. Lithuanian phoneme sets of different size are given and tested by Greibus et al. (2017) in the context of speech recognition. The experiment results show that the Baseline phoneme set (set without palatalization and stress) outperformed other sets. ...
Article
Full-text available
The goal of this research is to find a set of acoustic parameters that are related to differences between Polish and Lithuanian language consonants. In order to identify these differences, an acoustic analysis is performed, and the phoneme sounds are described as the vectors of acoustic parameters. Parameters known from the speech domain as well as those from the music information retrieval area are employed. These parameters are time- and frequency-domain descriptors. English language as an auxiliary language is used in the experiments. In the first part of the experiments, an analysis of Lithuanian and Polish language samples is carried out, features are extracted, and the most discriminating ones are determined. In the second part of the experiments, automatic classification of Lithuanian/English, Polish/English, and Lithuanian/Polish phonemes is performed.
... Conveying words to the computer or commanding something by one's voice can initially transform the command to input all the words that can be helpful for them. By this way it makes them to serve themselves with the best result without looking into the keyboard the computer gives the output of the words in return that helps them to study in the appropriate manner (Greibus et al. 2017). ...
Article
Full-text available
The specified subfield of computational linguistics and computer science can said to be linked with speech recognition. Speech recognition can develop new variation technologies as well as methodologies generated as interdisciplinary concept. It can be considered to translate and recognize and satisfy the capability towards understanding and translating the words that are already spoken. It is more preciously said that in the most recent times this field has secured positive feedback by intense learning of voice recognition. Such evidences shows the proof that it has more market demand for implementing the application of specific data as voice recognition. Deployment of speech recognition systems can be utilized as the evidence shown to its analyzing methods that is helpful for designing each and every individual's future. It is said that the computer plays an important role for this process as by this all the translated words can be acknowledged by the texts also.
... The Quaero program was used and 18.3% WER was obtained on the development data set. Greibus et al. [25] presented the influence of phoneme set for determining the accuracy of Lithuanian speech commands recognition. Acoustic models were trained with LIEPA speech corpus and for Lithuanian language rulebased transformation was proposed. ...
Article
Natural language and human–machine interaction is a very much traversed as well as challenging research domain. However, the main objective is of getting the system that can communicate in well-organized manner with the human, regardless of operational environment. In this paper a systematic survey on Automatic Speech Recognition (ASR) for tonal languages spoken around the globe is carried out. The tonal languages of Asian, Indo-European and African continents are reviewed but the tonal languages of American and Austral-Asian are not reviewed. The most important part of this paper is to present the work done in the previous years on the ASR of Asian continent tonal languages like Chinese, Thai, Vietnamese, Mandarin, Mizo, Bodo and Indo-European continent tonal languages like Punjabi, Lithuanian, Swedish, Croatian and African continent tonal languages like Yoruba and Hausa. Finally, the synthesis analysis is explored based on the findings. Many issues and challenges related with tonal languages are discussed. It is observed that the lot of work have been done for the Asian continent tonal languages i.e. Chinese, Thai, Vietnamese, Mandarin but little work been reported for the Mizo, Bodo, Indo-European tonal languages like Punjabi, Latvian, Lithuanian as well for the African continental tonal languages i.e. Hausa and Yourba.
... The Part 1 was transformed into format, used by the specific speech recognition system. Changes were applied to a phoneme set and composition of a pronunciation dictionary by a specific software (Greibus et al., 2017). This example demonstrates that every speech corpus can be used according to the needs of speech researchers and processed in different ways. ...
Article
Full-text available
The problem of speech corpus for design of human-computer interfaces working in voice recognition and synthesis mode is investigated. Specific requirements of speech corpus for speech recognizers and synthesizers were accented. It has been discussed that in order to develop above mentioned speech corpus, it has to consist of two parts. One part of speech corpus should be presented for the needs of Lithuanian text-to-speech synthesizers, another part of speech corpus-for the needs of Lithuanian speech recognition engines. It has been determined that the part of speech corpus designed for speech recognition engines has to ensure the availability to present language specificity by the use of different sets of phonemes. According to the research results, the speech corpus Liepa, which consists of two parts, was developed. This speech corpus opens possibilities for cost-effective and flexible development of human-computer interfaces working in voice recognition and synthesis mode.
Article
Full-text available
Paper deals with application of Microsoft Office Communications Server Speech Server or MSS'2007 for Lithuanian digits recognition. Voice servers integrate together telephony, speech and internet and provides tools for developing applications that run over a telephone. Using of transcriptions of Lithuanian words so far is the only solution of voice servers application for Lithuanian language. The results of investigation of Lithuanian digits recognition by German, English, French and Spanish speech recognition engines implemented on MSS'2007 are presented. The best accuracy of Lithuanian digits recognition was achieved by Spanish recognizer. It could be increased using the custom pronunciations of Lithuanian digits prepared using Universal Phone Set (UPS) labels. Demonstration of user identification by telephone using Spanish recognizer is prepared.
Conference Paper
Full-text available
In this paper, the authors present the results of ongoing research on Large Vocabulary Automatic Speech Recognition for the Latvian language. The paper describes the initial acoustic model, phoneme set, filler and noise models, and grapheme-to-phoneme modelling. The second part of this work is focused on language modelling. Different word and class-based n-gram models are evaluated in terms of perplexity and word error rate in a speech recognition task. The authors also train a recurrent neural network language model and use it for n-best rescoring.
Conference Paper
Full-text available
Advances in speech processing research rely on the availability of public resources such as corpora, statistical models and baseline systems. In contrast to languages such as English, there are few specific resources for Brazilian Portuguese. This work describes efforts aiming to decrease such gap. Baseline acoustic models for Brazilian Portuguese were built using the CMU Sphinx toolkit and public domain resources: speech corpora, phonetic dictionary and language model. Experiments were carried on for dictation and grammar tasks and the obtained results can be used to support further researches. Part of the trained acoustic models and a reference speech corpus were made publicly available.
Article
Full-text available
Paper deals with the Lithuanian digits and the sequence of Lithuanian digits recognition by recognition engines of other languages. Preparing of telephony applications on Microsoft Office Communications Server 2007 Speech Server is examined. Results of Lithuanian digits recognition by English and Spanish recognition engines are presented.
Article
Full-text available
Urdu language processing applications encounter non-Urdu text specifically English text frequently. The accuracy of these systems e.g. machine translation, text-to-speech etc. is highly undermined as they are unable to handle English text. One possibility could be addition of multilingual language processing capabilities in Urdu language processing applications so that they may handle English text also along with Urdu but this approach is quite taxing. Another approach to handle English text is to transliterate it into Urdu automatically and then pass it on to the Urdu language processing applications. This paper describes English to Urdu transliteration system. First the mapping rules that are used to generate Urdu text from English transcription are discussed then syllabification, manual transliteration and Urduization phase is described and finally the issues related to Out-Of-Vocabulary (OOV) are discussed.
Article
Full-text available
This paper presents a knowledge-based approach to grapheme-to-phoneme conversion (G2P) of isolated words of Lithuanian. Grapheme-to-phoneme conversion is performed in three consecutive steps: syllable boundary identification, accentuation, and transcription. Automatic accentuation is the most challenging task that is solved by combining lexicon with the accentuation rules formalized for every grammatical category. The algorithm is evaluated on the list of 50 000 word types which is obtained by selecting 100 most frequent word types per grammatical category. The proposed algorithm achieved 93.5% and 98.9% G2P accuracy at word and grapheme level respectively.
Article
Full-text available
This paper deals with one of the components of text-to-speech synthesis of Lithuanian language namely - text transcription. Formal rules' method is used for text transcription. In this work the suitability of this method is grounded, an analysis of the form of rules to fit is made and the set of rules and interpreting algorithm is presented. Contextual information, features of stress, syllable boundaries and softness are used in the rules.
Article
Full-text available
The present work is concerned with speech recognition using a small or medium size vo- cabulary. The possibility to use the English speech recognizer for the recognition of Lithuanian was investigated. Two methods were used to deal with such problems: the expert-driven (knowledge- based) method and the data-driven one. Phonological systems of English and Lithuanian were com- pared on the basis of the knowledge of phonology, and relations between certain Lithuanian and En- glish phonemes were established. Situations in which correspondences between the phonemes were to be established experimentally (i.e., using the data-driven method) and the English phonemes that best matched the Lithuanian sounds or their combinations (e.g., diphthongs) in such situations were identified. The results obtained were used for creating transcriptions of the Lithuanian names and surnames that were used in recognition experiments. The experiments without transcriptions, with a single transcription and with many transcriptions were carried on. The method that allowed finding a small number of best transcriptions was proposed. The recognition rate achieved was as follows: 84.2% with the vocabulary containing 500 word pairs.
Conference Paper
Full-text available
Several approaches have been adopted over the years for grapheme-to-phone conversion for European Portuguese: hand-derived rules, neural networks, classification and regression trees, etc. This paper describes different approaches implemented as weighted finite state transducers (WFST), motivated by their flexibility in integrating multiples sources of information and other interesting properties such as inversion. We describe and compare rule-based, data-driven and hybrid approaches. Best results were obtained with the rule-based approach, but one should take into account the fact that the data-driven one was trained with automatically transcribed material.
Article
This paper describes a speech-to-text system for semi-spontaneous Estonian speech. The system is trained on about 100 hours of manually transcribed speech and a 300M word text corpus. Compound words are split before building the language model and reconstructed from recognizer output using a hidden event N-gram model. We use a three pass transcription strategy with unsupervised speaker adaptation between individual passes. The system achieves a word error rate of 34.6% on conference speeches and 25.6% on radio talk shows.
Conference Paper
Computerized systems with voice user interfaces could save time and ease the work of healthcare practitioners. To achieve this goal voice user interface should be reliable (to recognize the commands with high enough accuracy) and properly designed (to be convenient for the user). The paper deals with hybrid approach implementation issues for the voice commands recognition. By the hybrid approach we assume the combination of several different recognition methods to achieve higher recognition accuracy. The experimental results show that most voice commands are recognized good enough but there is some set of voice commands which recognition is more complicated. In this paper the novel method is proposed for the combination of several recognition methods based on the Ripper algorithm. Experimental evaluation showed that this method allows achieve higher recognition accuracy than application of blind combination rule.
Article
Paper presents research results obtained when building a speaker independent hybrid speech recognizer. This recognizer will be integrated as a phrase recognizer in a medical-pharmaceutical information system. The hybrid speech recognizer consists of two recognition components: an adapted commercial Microsoft Spanish speech recognizer and a locally developed hidden Markov models based recognizer implementing Lithuanian acoustic models. Efficiency of both recognition components was evaluated on multiple speaker independent speech recognition tasks. The average accuracy of Lithuanian recognizer was higher reaching 0.6% phrase error rate for user requests in medical-pharmaceutical domain. The adapted commercial Spanish speech recognizer showed the ability to improve the accuracy of Lithuanian recognizer in the worst recognition scenarios. These results proved the hypothesis formulated when proposing the basic idea of hybrid recognition approach: recognition errors from different recognizers built using various techniques are not strongly correlated. This fact could be exploited for improved overall speech recognition accuracy.
Article
In this paper, the opening work on the development of a Lithuanian HMM speech recognition system is described. The triphone single-Gaussian HMM speech recognition system based on Mel Frequency Cepstral Coefficients (MFCC) was developed using HTK toolkit. Hidden Markov model's parameters were estimated from phone-level hand-annotated Lithuanian speech corpus. The system was evaluated on a speaker-independent <750 distinct isolated-word recognition task. Though the speaker adaptation and language modeling techniques were not used, the system was performing at 20% word error rate.
Conference Paper
Since recording technology has become more robust and easier to use, more and more universities are taking the opportunity to record their lectures and put them on the Web in order to make them accessable by students. The automatic speech recognition (ASR) techniques provide a valueable source for indexing and retrieval of lecture video materials. In this paper, we evaluate the state-of-the-art speech recognition software to find a solution for the automatic transcription of German lecture videos. Our experimental results show that the word error rates (WERs) was reduced by 12.8% when the speech training corpus of a lecturer is increased by 1.6 hours.
Article
This paper presents design, development and contents of Lithuanian continuous speech corpus LRN 0.1 (Lithuanian Radio News, prototype-version 0.1). The corpus contains 17 hours 23 minutes of records from radio broad- cast news read by 31 speakers. The recorded material is segmented into sentence-length records that are divided into training, development, and evaluation sets. Speech recordings are accompanied by word level transcriptions and auto- matically generated word-to-phone lexicon. The corpus is designed for the constructing and evaluating speaker-inde- pendent continuous speech recognition systems, and may also be used for linguistic research.
Article
Command and control (C&C) speech recognition allows users to in- teract with a system by speaking commands or asking questions restricted to a fixed grammar containing pre-defined phrases. Whereas C&C interaction has been commonplace in telephony and accessibility systems for many years, only recently have mobile devices had the memory and processing capacity to support client-side speech recognition. Given the personal nature of mobile devices, statistical models that can predict commands based in part on past user behavior hold promise for improving C&C recognition accuracy. For example, if a user calls a spouse at the end of every workday, the language model could be adapted to weight the spouse more than other contacts during that time. In this paper, we describe and assess statistical models learned from a large population of users for predicting the next user command of a commercial C&C application. We explain how these models were used for language modeling, and evaluate their performance in terms of task completion. The best performing model achieved a 26% relative reduction in error rate compared to the base system. Finally, we investigate the eects of person- alization on performance at dierent learning rates via online updating of model parameters based on individual user data. Personalization significantly improved task completion and increased relative reduction in error rate by an additional 5%.
Article
There is presented a technique of transcribing Lithuanian text into phonemes for speech recognition. Text-phoneme transformation has been made by formal rules and the dictionary. For- mal rules were designed to set the relationship between segments of the text and units of formalized speech sounds - phonemes, dictionary - to correct transcription and specify stress mark and po- sition. Proposed the automatic transcription technique was tested by comparing its results with manually obtained ones. The experiment has shown that less than 6% of transcribed words have not matched.
Article
Thesis (M.S.E.)--University of Tulsa, 1998. Includes bibliographical references (leaves 37-38).
Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices
  • D Huggins-Daines
  • M Kumar
  • A Chan
  • A W Black
  • M Ravishankar
  • A I Rudnicky
Sampa (speech assessment methods phonetic alphabet) for encoding transcriptions of lithuanian speech corpora
  • A Raškinis
  • G Raškinis
  • A Kazlauskienė