Conference PaperPDF Available

?Google? Lithuanian Speech Recognition Efficiency Evaluation Research

Authors:

Abstract and Figures

This paper presents “Google” Lithuanian speech recognition efficiency evaluation research. For the experiment it was chosen method that consists of three parts: (1) to process all voice records without adding any noise; (2) process all voice records with several different types of noise, modified so as to get some predefined signal-to-noise ratio (SNR); (3) after one month reprocess all voice records without any additional noise and to assess improvements in the quality of the speech recognition. It was chosen WER metrics for speech recognition quality assessment. Analyzing the results of the experiment it was observed that the greatest impact on the quality of speech recognition has a SNR and speech type (most recognizable is isolated words, the worst - spontaneous speech). Meanwhile, characteristics such as the gender of the speaker, smooth speech, speech speed, speech volume does not make any significant influence on speech recognition quality.
Content may be subject to copyright.
“Google” Lithuanian Speech Recognition Efficiency
Evaluation Research
Donatas Sipavičius
()
and Rytis Maskeliūnas
Department of Multimedia Engeneering, Kaunas University of Technology,
Studentu St. 50, 51368 Kaunas, Lithuania
donatas.sipavicius@ktu.edu
Abstract. This paper presents “Google” Lithuanian speech recognition effi
ciency evaluation resear ch. For the experiment it was chosen method that consists
of three parts: (1) to process all voice records without adding any noise; (2)
process all voice records with several different types of noise, modified so as to
get some predefined signal-to-noise ratio (SNR); (3) after one month reprocess
all voice records without any additional noise and to assess improvements in the
quality of the speech recognition. It was chosen WER metrics for speech recog‐
nition quality assessment. Analyzing the results of the experiment it was observed
that the greatest impact on the quality of speech recogn ition has a SNR and spee ch
type (most recognizable is isolated words, the worst - spontaneous speech).
Meanwhile, characteristics such as the gender of the speaker, smooth speech,
speech speed, speech volume does not make any significant influence on speech
recognition quality.
Keywords: “Google” lithuanian speech recognition · Speech recognition · WER
(Word Error Rate) · Signal-to-Noise Ratio (SNR)
1 Introduction
In Lithuania keep going speech recognition researches, trying to adapt speech recogni‐
tion systems to practical challenges. There are number of researches of foreign languages
application for the Lithuanian speech recognition area. Authors of project Liepa shared
their ideas, that the usage of foreign language based recognizers for Lithuanian speech
is often limited vocabulary and economically ineffective [1]. “Google” Lithuanian
speech recognizer was released for public use in 2015. “Google” its speech recognition
technology now has only an 8 % word error rate (WER) [2].
Using speech recognizers often faced with the problem, that some part of words are
recognized incorrectly or not recognized at all. This results in speech recognition quality
dependence on the following factors:
Specifity of recognizing language (e.g., Lithuanian words changing the endings for
communications sentence means; changing word ending it gets another meaning);
Specific terminology specific to a particular application area;
Speaker dialect, speaking volume, sharpness, various speech disorders;
© Springer International Publishing Switzerland 2016
G. Dregvaite and R. Damasevicius (Eds.): ICIST 2016, CCIS 639, pp. 602–612, 2016.
DOI: 10.1007/978-3-319-46254-7_49
Background noise level (volume) and its nature (factory, rolling train, shopping
center and etc.).
When choosing a speech recognizer must take in account of its characteristics, such
as speech type (isolated words, connected words, continuous speech or spontaneous
speech must be recognized), size of vocabulary (possible to recognize the number of
words), dependence /independence from the speaker (the number of recognizable
speakers), communication channels and the environment (resistance to distortion of the
communication channel and background noise).
This paper presents “Google” Lithuanian speech recognition efficiency evaluation
research. The experiment was accomplished with 63299 voice records, 3318 different
phrases. All these voice records have been processed by “Google” speech recognizer at
least 18 times (not counting testing of experimental software): 1 time without any addi‐
tional noise, 16 times with noise (4 different noise types, 4 SNR), 1 time without any
additional noise after one month. WER value and the number of voice recordings that
have been processed and received speech recognition results were assessed during the
experiment.
The paper is structured as follows. The review of related work is presented in second
part of the paper. The experimental setup is described in the third part. The fourth part
of the paper introduces experimental data and the results of the experiment. The work
conclusions are presented in the fifth part of the paper.
2 Related Works
[3] deals with some aspects of development voice user interfaces for several applications
(several digits name (0-9); commands for internet browser, text editor and media player
control). The experimental investigation showed that 90 % recognition accuracy was
achieved in average using adaptation of foreign language speech engine. Detailed anal‐
ysis showed that most commands are recognized with very high accuracy.
[4] describes development of the Lithuanian voice controlled interface for the
medical- pharmaceutical information system. Authors were able to achieve 95 % recog‐
nition accuracy acceptable for the practitioners. A phrase recognizer in a medical-phar‐
maceutical information system achieved 14.5 % average error rate for names of diseases
[5]. Proprietary Hybrid approach [6] for Lithuanian medical terms allows achieving a
99.1 % overall speech recognition accuracy. In [7] the best achieved recognition was
98.9 % for 1000 Lithuanian voice commands (diseases, complaints, drugs).
Another Lithuanian voice recognition system of medical- pharmaceutical terms is
presented in [8]. Investigations showed that Lithuanian speech recognizer achieves
higher accuracy (over 96 % in a speaker independent mode) but the use of the adapted
foreign language recognizer allows increase this baseline accuracy even further (over
98 % in a speaker independent mode for 1000 voice commands) [8].
[9] deals with two elements of the artificial intelligence methods—the natural
language processing and machine learning to apply to small vocabulary Lithuanian
language recognition. The average hybrid operation accuracy reached was 99.24 %,
“Google” Lithuanian Speech Recognition Efficiency 603
when the recognizer recognizes voice commands out of 12 known speakers, and was
equal to 99.18 %, when it was applied to the unknown speaker.
[10] Same researchers evaluated the use of two recognizers for Lithuanian medical
desease digit code corpus and achieved around 96 % accuracy for clean speech. Similar
approach to [9] was used in [11, 12]. In the first work the highest accuracy (98.16 %)
was obtained when k-Nearest neighbors method was used with 15 nearest neighbors, in
the second they have achieved 95 % accuracy by combining 4 different language recog‐
nizers.
[13] proposes a method for features quality estimation that does not require recog‐
nition experiments and accelerate automatic speech recognition system development.
The key component of method is usage of metrics right after front-end features compu‐
tation. The experimental results show that method is suitable for recognition systems
with back-end Euclidean space classifiers.
3 Experimental Setup
“Google” Lithuanian speech recognition efficiency evaluation research method (Fig. 1)
consists of three parts:
1. First of all to process all voice records without adding any noise.
2. Process all voice records with several different types of noise, modified so as to get
some predefined signal to noise ratios (SNR).
3. “Google” announced its advancements in deep learning, a type of artificial intelli‐
gence, for speech recognition [2], so it’s appropriate to reprocess all voice records
without any additional noise after one month and to assess improvements in the
quality of the speech recognition.
Fig. 1. “Google” speech recognition efficiency evaluation research method
Two methods are needed for processing audio records with “Google” speech recog‐
niton:
1. Method # 1. One computer may send many (it depends on thread count) audio
records to “Google” speech recognizer at once. Audio format must be FLAC so they
604 D. Sipavičius and R. Maskeliūnas
must be properly pre-processed. An audio adapter is unnecessary. We cannot rely
only on this solution because the audio processing can be stopped at any time if
“Google” disables the active key.
All records which format is FLAC are selected from user defined folder. Then
unprocessed audio record is selected and from text file (audio record name and text
file name are the same) is read information about audio record (noise type, SNR,
sample rate). “Google” speech recognizer is processing received audio record and
returns speech recognition result in JSON format. Received speech recognition result
and information about noise record, SNR (if noise has been used) are saved in DB.
Then speech recognition resuls are processed (decomposed to primary and alterna‐
tive results, all numerals are transformed into text, calculated WER). Processed
audio record and text file are deleted from file system.
2. Method # 2. One computer may send only one audio record (audio format does not
matter) to “Google” speech recognizer at once. Audio records processing may be
accelerated by connecting more computers. Audio adapter is required when
processing audio records by this method (experimental software is playing audio
record, “Google”: speech recognizer is listening for microphone).
When audio records processing runs it automatically opens web site in “Google
Chrome” browser and initiates communication component, which initiates webkit‐
SpeechRecognition component. Then voice record and, if needed, noise record and
SNR are selected. WebkitSpeechRecognition component gets message to listen
microphone and process sound which is playing. WebkitSpeechRecognition compo‐
nent starts listening sound from microphone while he gets message about audio
record playing is finished. If needed noise record is playing at necessary SNR while
audio record is playing (noise record and audio record are playing at one time). When
audio record is finished, noise record stops playing. WebkitSpeechRecognition
component stops listening microphone and is waiting for speech recognition results.
“Google” speech recognizer is processing received audio record and returns speech
recognition result in JSON format. Received speech recognition result and infor‐
mation about noise record, SNR (if noise has been used) are saved in DB. Then
speech recognition resuls are processed (decomposed to primary and alternative
results, all numerals are transformed into text, calculated WER).
Necessary environment for realization. Experimental software consists of (Fig. 2):
Fig. 2. Decision context diagram
“Google” Lithuanian Speech Recognition Efficiency 605
Data entry subsystem – implemented to import existing voice records to DB, to
record new voice records, to describe voice records with additional information.
Participant of experiment and executor of experiment uses this subsystem. Imple‐
mented with ASP.NET for participant of experiment use cases and “Windows
Forms” for executor of experiment use cases.
Audio records processing subsystem – responsible for processing audio records.
Only executor of experiment uses this subsystem. Detailed information below.
Results processing subsystem – processes the received speech recognition results
and presents the aggregated information. Only executor of experiment uses this
subsystem. Implemented with “Windows Forms”.
Audio records processing subsystem consists of:
1. Audio records processing (Method # 1)
Audio records preparation for processing – this process converts all needed voice
records to FLAC format, also all voice records combines with all possible noise types
modified so as to get some predefined SNRs.
Audio records processing – this process sends many audio records (along with audio
format and sample rate) to “Google” speech recognizer at once.
2. Audio records processing (Method # 2)
Audio records processing – this process takes voice record from DB, runs browser
“Google Chrome” in automatic microphone use state and opens web site with
communication component and webkitSpeechRecognition component. Through
communication component communicates with webkitSpeechRecognition compo‐
nent and saving speech recognition results. Implemented with “Windows Forms”.
Communication component – this component is like intermediary between audio
record process and webkitSpeechRecognition component. Implemented with
ASP.NET SignalR.
webkitSpeechRecognition – “Google” speech recognition client side component that
works in browser “Google Chrome”. It processes audio record and returns speech
recognition resuls (text).
Metrics. Choosing appropriate metrics to track the quality of the system is critical to
success [14]. The common metrics that used to evaluate the quality of the recognizer
[14]:
Word Error Rate (WER) – measures misrecognitions at the word level: it compares
the words outputted by the recognizer to those the user really spoke. Every error
(substitution, insertion or deletion) is counted against the recognizer [14].
(1)
Semantic Quality (WebScore) – tracking the semantic quality of the recognizer
(WebScore) by measuring how many times the search result as queried by the recog‐
nition hypothesis varies from the search result as queried by a human transcription
606 D. Sipavičius and R. Maskeliūnas
[14]. A better recognizer has a higher WebScore [14]. The WebScore gives a much
clearer picture of what the user experiences when they search by voice [14]. In all
research authors tend to focus on optimizing this metric, rather than the more tradi‐
tional WER metric defined above [14].
Perplexity (PPL) – a measure of the size of the set of words that can be recognized
next, given the previously recognized words in the query [14]. This gives a rough
measure of the quality of the language model – the lower the perplexity, the better
the model is at predicting the next word [14].
Out-of-Vocabulary (OOV) Rate – tracks the percentage of words spoken by the
user that are not modeled by our language model [14]. It is important to keep this
number as low as possible, because any word spoken by users that is not in vocabulary
will ultimately result in a recognition error; furthermore, these recognition errors may
also cause errors in surrounding words due to the subsequent poor predictions of the
language model and acoustic misalignments. [14].
Latency – is defned as the total time (in seconds) it takes to complete a search request
by voice (the time from when the user finishes speaking until the search results appear
on the screen) [14]. Many factors contribute to latency as perceived by the user [14]:
(a) the time it takes the system to detect end-of-speech, (b) the total time to recognize
the spoken query, (c) the time to perform the web query, (d) the time to return the
web search results back to the client over the network, and (e) the time it takes to
render the search results in the browser of the users phone.
4 Experimental Research
4.1 Experimental Data
An experiment was accomplished with 63299 voice records, the total duration of these
records is 86.06 h, average duration per record is 4.89 s. 359 speakers participated in
the experiment: male – 111 (30.92 %), female – 248 (69.08 %).
Detailed information about voice records is presented in Tables 1 and 2.
Table 1. Information about voice records
Speech attribute Yes No
Pcs. % Pcs. %
Smooth speech 61930 97.84 1369 2.16
Speech with accent 1640 2.59 61659 97.41
Fast speech 23773 37.56 39526 62.44
Loud speech 62869 99.32 430 0.68
Table 2. Voice records distribution by type of speech
Isolated words Connected words Continuous speech Spontaneous speech
Pcs. % Pcs. % Pcs. % Pcs. %
39349 62.16 35 0.06 23887 37.74 28 0.04
“Google” Lithuanian Speech Recognition Efficiency 607
All voice records were processed without noise and with 4 different types of noise
(train station, traffic, car driving and white noise), modified so as to get some predefined
signal to noise ratios (SNR): 25 dB, 20 dB, 15 dB and 10 dB.
4.2 The Results of the Experiment
Speech recognition results (without noise). After processing all 63299 for the first
time (without any noise) for 20784 voice records there was no speech recognition result
at all. In this subsection statistic is presented for 42515 voice records (238885 phrases),
for which “Google” speech recognition returned results. Average WER for all 42515
voice records is 40.74 % (Table 3). This value is obtained by calculating all voice records
with speech recognition results average of WER values. WER standard deviation
1
37.70 ± 0.30 (± 0.30 value is calculated confidence interval with a 90 % probability that
the value is in this range).
Table 3. Speech recognition results by speakers
Speech recognition results by speakers
Average Best Average of 3 best
speakers
Worst Average of 3
worst speakers
WER, % 40.74 10.00 14.74 100.00 96.39
Speech recognition results by speaker’s gender: female − 39.91 % (WER standard devi‐
ation − 38.54 ± 0.36), male − 42.96 % (WER standard deviation − 35.27 ± 0.54).
Speech recognition results by speech attribute are: not smooth speech − 38.71 %
(WER standard deviation − 34.91 ± 1.75), smooth speech − 40.79 % (WER standard
deviation − 37.77 ± 0.31); speech without accent − 40.56 % (WER standard deviation
− 37.76 ± 0.30), speech with accent − 48.39 % (WER standard deviation
− 34.25 ± 1.81); fast speech − 42.33 (WER standard deviation − 39.06 ± 0.53), normal
speech − 39.90 % (WER standard deviation − 36.95 ± 0.36); quiet speech − 43.59 %
(WER standard deviation − 27.96 ± 3.49), loud speech − 40.73 % (WER standard
deviation − 37.74 ± 0.30) (Table 4).
Table 4. Speech recognition results by speech type
Speech type Speech recognition results
Number of phrases Words of phrases Average WER, % WER standard
deviation
Isolated words 22206 40555 31.55 43.29 ± 0,48
Connected words 12 246 64.55 30.98 ± 14,71
Continuous speech 20297 198084 50.77 27.09 ± 0,31
Spontaneous
speech
0– – –
1The standard deviation is a numerical value used to indicate how widely individuals in a group
vary. If individual observations vary greatly from the group mean, the standard deviation is
big; and vice versa.
608 D. Sipavičius and R. Maskeliūnas
It could be seen that speech type has significant impact on the quality of speech
recognition: best speech recognition results are with isolated words, connected words
with maximum average WER (12 out of 35 voice records), the worst – spontaneous
speech (no of 28 voice records was recognized).
The worst recognized phrases are (average WER is 2.0): penktas (recognized as
“tanki test”); išsijunk (recognized as “iš trijų”); sviestas (recognized as “speed test”);
lizosoma (recognized as “visos stoma”); taikyk (recognized as “tai kiek”); dešinėn
(recognized as “dešimt min”); padidink (recognized as “lady zippy”); šunkelis (recog‐
nized as “šunų kelis”); išsijunk (recognized as “iš jų”); bjaurus (recognized as “į eurus”);
rnr (recognized as “prie neries”); aštuntas (recognized as “pašto kodas”).
Speech recognition results (with noise). In this subsection presented speech recog‐
nition results when all 63299 voice records were processed with noise (4 different noise
types, 4 different SNRs).
Figure 3 shows how recognized voice records quantity dependends on the noise type
and SNR. It could be seen that most of speech recognition results (not necessarily
correct) are with SNR at 25 dB (exeption noise “Car driving” which most of speech
recognition results are with SNR at 20 dB).
Fig. 3. Recognized voice records quantity dependence on the noise type and SNR
Figure 4 shows how WER dependends on the noise type and SNR. It could be seen
that best speech recognition is when SNR = 25 dB (that is with the weakest noise).
“Google” speech recognition assessment of improvements. After processing all
63299 voice records after one month (without noise) for 20868 voice records there was
no speech recognition result (84 records more than one month ago). Average WER for
42431 recognized voice records is 40.82 %. It could be seen that average WER is 0.08 %
worse than one month ago.
“Google” Lithuanian Speech Recognition Efficiency 609
5 Conclusions and Future Works
After experimental software testing it was noticed that numerals in speech recognition
result can be provided as numbers, text and numbers with text (etc. “1940”, “tūkstantis
devyni šimtai keturiasdešimtieji”, “1940-ieji”, “1940i”) even multiple times processing
the same phrase. That makes some trouble if you want to compare text that was spoken
and the result of speech recognizer (it needs to replace numbers to text in all speech
recognition results), but semantically speech recognition results can be correct.
On first stage of experiment it was noticed that from processed 63299 audio records
there were returned no results for 20784 records and this takes 32.83 % of all records.
On the third stage of experiment, when it was hoped that speech recognizer has trained
processing recordings from previous stages and results would be better, there were
returned no results for 84 records more than on stage one.
Average WER value for all speech records that were proceesed by “Google” speech
recognizer and has results is 40.74 % with standard deviation at 37.70 %.
On the basis of experiment results analysis we can predicate that speaker gender
makes little impact on speech recognition quality – average WER variation between
man and woman is 3.05 %. Also for speech recognition quality has little impact smooth
speech, speech speed, speech volume (average WER variation between comparable
groups varies from 1.49 up to 2.86 %), but speech with accent result is worse by 7.83 %.
Big impact for speech recognition has speech type: the best is recognized isolated words
and the worst is spontaneous speech.
After completion of speech recognition experiment with 4 different signal to noise
ratios (SNR) it was noticed, that best speech recognition is when SNR is 25 dB (with
lowest noise level). Most speech recognition results (not including correctness) were
returned when SNR was 25 dB (exception – noise “car driving”, which has got most
results when SNR was 20 dB).
40.00
45.00
50.00
55.00
60.00
65.00
70.00
5 1015202530
Average WER, %
SNR
White noise Train station
Car driving Traffic
Fig. 4. WER dependence on the noise type and SNR
610 D. Sipavičius and R. Maskeliūnas
Experimental research results showed that quite often case one of the alternative
results is more correct than primary result. This means that speech recognition results
could be better if “Google” speech recognizer more precisely estimate which of the
results should be final.
A month later after accomplishing processing speech records without noise they were
processed repeatedly and the results showed that “Google” speech recognizer recogni‐
tion quality hasn’t got better, because speech recognition results (average WER) got by
0.08 % worse than at initial experiment got value. While “Google” speech recognizer is
free we can make assumption that “Google” is seeing speech recognizer drawbacks and
speech recognizer will be improved.
Experiment speech records should be the same size in every classified group to
increase confidence of experiment results.
References
1. Telksnys, A.L., Navickas, G.: Žmonių ir kompiuterių sąveika šnekant. In: Kompiuterininkų
dienos - 2015, ISBN: 9789986343134, pp. 185–193. Žara. Vilnius (2015)
2. Google says its speech recognition technology now has only an 8 % word error rate. http://
venturebeat.com/2015/05/28/google-says-its-speech-recognition-technology-now-has-
only-an-8-word-error-rate/, 25 Apr. 2016
3. Maskeliunas, R., Ratkevicius, K., Rudzionis, V.: Some aspects of voice user interfaces
development for internet and computer control applications. Elektronika ir elektrotechnika
19(2), 53–56 (2013). ISSN 1392-1215
4. Rudzionis, V., Ratkevicius, K., Rudzionis, A., Maskeliunas, R., Raskinis, G.: Voice
controlled interface for the medical-pharmaceutical information system. In: Skersys, T.,
Butleris, R., Butkiene, R. (eds.) ICIST 2012. CCIS, vol. 319, pp. 288–296. Springer,
Heidelberg (2012). ISBN: 9783642333071
5. Rudzionis, V., Raskinis, G., Maskeliunas, R., Rudzionis, A., Ratkevicius, K.: Comparative
analysis of adapted foreign language and native lithuanian speech recognizers for voice user
interface. Elektronika ir elektrotechnika 19(7), 90–93 (2013). ISSN 1392-1215
6. Rudžionis, V., Ratkevičius, K., Rudžionis, A., Raškinis, G., Maskeliunas, R.: Recognition of
voice commands using hybrid approach. In: Skersys, T., Butleris, R., Butkiene, R. (eds.)
ICIST 2013. CCIS, vol. 403, pp. 249–260. Springer, Heidelberg (2013)
7. Rudzionis, V., Raskinis, G., Maskeliunas, R., Rudzionis, A., Ratkevicius, K., Bartisiute, G.:
Web services based hybrid recognizer of lithuanian voice commands. Elektronika ir
elektrotechnika 20(9), 50–53 (2014). ISSN 1392-1215
8. Rudžionis, V., Raškinis, G., Ratkevičius, K., Rudžionis, A., Bartišiūtė, G.: Medical –
pharmaceutical information system with recognition of Lithuanian voice commands. In:
Human language technologies. In: The Baltic Perspective: Proceedings of the 6th
International Conference. ISBN: 978161499441, pp. 40–45. IOS Press. Amsterdam (2014)
9. Bartišiūtė, G., Ratkevičius, K., Paškauskaitė, G.: Hybrid recognition technology for isolated
voice commands. In: Information Systems Architecture and Technology: Proceedings of 36th
International Conference on Information Systems Architecture and Technology – ISAT 2015
– Part IV, ISBN 978-3-319-28565-8, pp. 207–216 (2016)
10. Bartišiūtė, G., Paškauskaitė, G., Ratkevičius, K.: Investigation of disease codes recognition
accuracy. In: Proceedings of the 9th International Conference on Electrical and Control
Technologies, ECT 2014, pp. 60–63 (2014)
“Google” Lithuanian Speech Recognition Efficiency 611
11. Rasymas, T., Rudžionis, V.: Evaluation of methods to combine different speech recognizers.
In: Computer Science and Information Systems (FedCSIS), pp. 1043–1047 (2015)
12. Rasymas, T., Rudžionis, V.: Lithuanian digits recognition by using hybrid approach by
combining lithuanian google recognizer and some foreign language recognizers. In:
Information and Software Technologies, ISBN 978-3-319-24769-4, pp 449–459 (2015)
13. Lileikytė, R., Telksnys, A.L.: Metrics based quality estimation of speech recognition features.
Informatica Vilnius, Matematikos ir informatikos institutas 24(3), 435–446 (2013). ISSN:
0868-4952
14. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garret, M.,
Strope, B.: Google Search by Voice: A case study
612 D. Sipavičius and R. Maskeliūnas
... Compared to the performance of the Lithuanian language Google speech recognizer on the same LIEPA corpora [51], this solution was on average around 30% more efficient. ...
Article
Full-text available
Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).
Chapter
Full-text available
The paper deals with two elements of the artificial intelligence methods—the natural language processing and machine learning. Hybrid recognition technology for isolated Lithuanian voice commands is described. By the hybrid approach we assume the combination of two different recognition methods to achieve higher recognition accuracy. The method which is based on the machine learning algorithm to combine the recognition results provided by two different recognizers is described. The first recognizer was HTK-based Lithuanian recognizer, the second one—the Spanish language recognizer adapted to the Lithuanian language. The experimental results show that a hybrid decision-making rule learned by “random forest” classifier works with 99.46 % accuracy and exceeds the accuracy of the “blind” decision-making rule (96.12 %). The average hybrid operation accuracy reaches 99.24 %, when the recognizer recognizes voice commands out of 12 known speakers, and is equal to 99.18 %, when it is applied to the unknown speaker.
Conference Paper
Full-text available
In this paper we are presenting our results obtained by experimenting with different classification methods which are suitable for creating hybrid speech recognizer. We tried to create Lithuanian digits recognizer which is capable of producing more than 95 % accuracy. Classification methods that were used are: k-Nearest neighbors (5, 11, 15, and 21), Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression, Naïve Bayes, Support Vectors classifier. Experiments were taken using five foreign recognizers: English, Russian, two German and Google recognizer for Lithuanian language. Best results were received when Naïve Bayes classifier was used – 97.51 %. Average accuracy of single recognizers: English – 84.9 %, Russian – 80.6 %, German I – 63 %, German II – 81.4 % and Google recognizer – 82.6 %. By using hybrid approach accuracy was increased by 12.61 % compared with best single recognizer result.
Article
Full-text available
This paper presents the recently developed medical-pharmaceutical informative system with voice user interface. This is the first computerized system oriented towards healthcare services and industry where Lithuanian voice commands are used as a primary mean for control. Another essential property of the developed system is its hybrid nature: two different recognizers - an adapted commercial Spanish speech recognizer available from Microsoft and a locally developed HMM speech recognizer based on Lithuanian acoustic models - are operating in parallel. The recognition hypotheses produced by those recognizers are joined together using logical rules obtained using decision rules induction algorithms such as Ripper. All these measures and approaches allowed achieve very high speaker independent voice commands recognition accuracy acceptable for the system implementation in practice. The best achieved recognition was 98.9 % for 1000 Lithuanian voice commands. The paper presents optimization issues related with the development of the system.
Article
Full-text available
This paper deals with some aspects of development voice user interfaces for several applications. The realistic scenario when developing VUI for such languages as Lithuanian is adaptation of foreign language speech recognizer. But not all voice commands could be recognized successfully enough using adapted recognizer. Hence the importance to implement hybrid recognizer arises. Several practical applications using Lithuanian voice command recognitions are presented. The experimental investigation showed that 90 percent recognition accuracy was achieved in average using adaptation of foreign language speech engine. Detailed analysis showed that most commands are recognized with very high accuracy. For those commands that aren't recognized accurately enough hybrid recognition principle may be applied.
Conference Paper
This paper describes our efforts developing the voice controlled interface for the medical- pharmaceutical information system. It is well known that voice controlled interfaces are of particular importance and may provide significant convenience for healthcare professionals. Many international IT companies provide speech processing by using tools oriented at the physicians or at other professionals working in the related areas. Lithuanian professionals feel a significant lack of speech processing since there are still no attempts to integrate Lithuanian speech processing into medical information systems in this country. The paper presents several IS under development for the use in Lithuanian healthcare institutions. The experimental evaluation shows that it is possible to achieve recognition accuracy (at least 95% of correct recognition) acceptable for the practitioners. The evaluation of system prototypes shows that voice interfaces may lead to the increased convenience of IT systems at the healthcare practitioners.
Article
Paper deals with the recognition accuracy of disease codes. It is impossible so far to recognize about 15000 various diseases, but each disease can be identified by its code, consisting of one letter and some digits. The appropriate Lithuanian names were selected for each letter and their recognition accuracy together with Lithuanian digit names recognition accuracy were investigated. Two types of recognizers were used for the above mentioned purpose: a) the adapted foreign language recognizer used in Windows'7 or Windows'8 operation systems; b) Lithuanian speech recognizer based on HMMs (Hidden Markov Models) modeling. The efficiency of both recognizers was evaluated using the speech corpora.
Conference Paper
Computerized systems with voice user interfaces could save time and ease the work of healthcare practitioners. To achieve this goal voice user interface should be reliable (to recognize the commands with high enough accuracy) and properly designed (to be convenient for the user). The paper deals with hybrid approach implementation issues for the voice commands recognition. By the hybrid approach we assume the combination of several different recognition methods to achieve higher recognition accuracy. The experimental results show that most voice commands are recognized good enough but there is some set of voice commands which recognition is more complicated. In this paper the novel method is proposed for the combination of several recognition methods based on the Ripper algorithm. Experimental evaluation showed that this method allows achieve higher recognition accuracy than application of blind combination rule.
Article
The performance of an automatic speech recognition system heavily depends on the used feature set. Quality of speech recognition features is estimated by classification error, but then the recognition experiments must be performed, including both front-end and back-end implementations. We propose a method for features quality estimation that does not require recognition experiments and accelerate automatic speech recognition system development. The key component of our method is usage of metrics right after front-end features computation. The experimental results show that our method is suitable for recognition systems with back-end Euclidean space classifiers.
Article
Paper presents research results obtained when building a speaker independent hybrid speech recognizer. This recognizer will be integrated as a phrase recognizer in a medical-pharmaceutical information system. The hybrid speech recognizer consists of two recognition components: an adapted commercial Microsoft Spanish speech recognizer and a locally developed hidden Markov models based recognizer implementing Lithuanian acoustic models. Efficiency of both recognition components was evaluated on multiple speaker independent speech recognition tasks. The average accuracy of Lithuanian recognizer was higher reaching 0.6% phrase error rate for user requests in medical-pharmaceutical domain. The adapted commercial Spanish speech recognizer showed the ability to improve the accuracy of Lithuanian recognizer in the worst recognition scenarios. These results proved the hypothesis formulated when proposing the basic idea of hybrid recognition approach: recognition errors from different recognizers built using various techniques are not strongly correlated. This fact could be exploited for improved overall speech recognition accuracy.