Content uploaded by Rytis Maskeliunas
Author content
All content in this area was uploaded by Rytis Maskeliunas on Dec 16, 2017
Content may be subject to copyright.
“Google” Lithuanian Speech Recognition Efficiency
Evaluation Research
Donatas Sipavičius
(✉)
and Rytis Maskeliūnas
Department of Multimedia Engeneering, Kaunas University of Technology,
Studentu St. 50, 51368 Kaunas, Lithuania
donatas.sipavicius@ktu.edu
Abstract. This paper presents “Google” Lithuanian speech recognition effi‐
ciency evaluation resear ch. For the experiment it was chosen method that consists
of three parts: (1) to process all voice records without adding any noise; (2)
process all voice records with several different types of noise, modified so as to
get some predefined signal-to-noise ratio (SNR); (3) after one month reprocess
all voice records without any additional noise and to assess improvements in the
quality of the speech recognition. It was chosen WER metrics for speech recog‐
nition quality assessment. Analyzing the results of the experiment it was observed
that the greatest impact on the quality of speech recogn ition has a SNR and spee ch
type (most recognizable is isolated words, the worst - spontaneous speech).
Meanwhile, characteristics such as the gender of the speaker, smooth speech,
speech speed, speech volume does not make any significant influence on speech
recognition quality.
Keywords: “Google” lithuanian speech recognition · Speech recognition · WER
(Word Error Rate) · Signal-to-Noise Ratio (SNR)
1 Introduction
In Lithuania keep going speech recognition researches, trying to adapt speech recogni‐
tion systems to practical challenges. There are number of researches of foreign languages
application for the Lithuanian speech recognition area. Authors of project Liepa shared
their ideas, that the usage of foreign language based recognizers for Lithuanian speech
is often limited vocabulary and economically ineffective [1]. “Google” Lithuanian
speech recognizer was released for public use in 2015. “Google” its speech recognition
technology now has only an 8 % word error rate (WER) [2].
Using speech recognizers often faced with the problem, that some part of words are
recognized incorrectly or not recognized at all. This results in speech recognition quality
dependence on the following factors:
•Specifity of recognizing language (e.g., Lithuanian words changing the endings for
communications sentence means; changing word ending it gets another meaning);
•Specific terminology specific to a particular application area;
• Speaker dialect, speaking volume, sharpness, various speech disorders;
© Springer International Publishing Switzerland 2016
G. Dregvaite and R. Damasevicius (Eds.): ICIST 2016, CCIS 639, pp. 602–612, 2016.
DOI: 10.1007/978-3-319-46254-7_49
• Background noise level (volume) and its nature (factory, rolling train, shopping
center and etc.).
When choosing a speech recognizer must take in account of its characteristics, such
as speech type (isolated words, connected words, continuous speech or spontaneous
speech must be recognized), size of vocabulary (possible to recognize the number of
words), dependence /independence from the speaker (the number of recognizable
speakers), communication channels and the environment (resistance to distortion of the
communication channel and background noise).
This paper presents “Google” Lithuanian speech recognition efficiency evaluation
research. The experiment was accomplished with 63299 voice records, 3318 different
phrases. All these voice records have been processed by “Google” speech recognizer at
least 18 times (not counting testing of experimental software): 1 time without any addi‐
tional noise, 16 times with noise (4 different noise types, 4 SNR), 1 time without any
additional noise after one month. WER value and the number of voice recordings that
have been processed and received speech recognition results were assessed during the
experiment.
The paper is structured as follows. The review of related work is presented in second
part of the paper. The experimental setup is described in the third part. The fourth part
of the paper introduces experimental data and the results of the experiment. The work
conclusions are presented in the fifth part of the paper.
2 Related Works
[3] deals with some aspects of development voice user interfaces for several applications
(several digits name (0-9); commands for internet browser, text editor and media player
control). The experimental investigation showed that 90 % recognition accuracy was
achieved in average using adaptation of foreign language speech engine. Detailed anal‐
ysis showed that most commands are recognized with very high accuracy.
[4] describes development of the Lithuanian voice controlled interface for the
medical- pharmaceutical information system. Authors were able to achieve 95 % recog‐
nition accuracy acceptable for the practitioners. A phrase recognizer in a medical-phar‐
maceutical information system achieved 14.5 % average error rate for names of diseases
[5]. Proprietary Hybrid approach [6] for Lithuanian medical terms allows achieving a
99.1 % overall speech recognition accuracy. In [7] the best achieved recognition was
98.9 % for 1000 Lithuanian voice commands (diseases, complaints, drugs).
Another Lithuanian voice recognition system of medical- pharmaceutical terms is
presented in [8]. Investigations showed that Lithuanian speech recognizer achieves
higher accuracy (over 96 % in a speaker independent mode) but the use of the adapted
foreign language recognizer allows increase this baseline accuracy even further (over
98 % in a speaker independent mode for 1000 voice commands) [8].
[9] deals with two elements of the artificial intelligence methods—the natural
language processing and machine learning to apply to small vocabulary Lithuanian
language recognition. The average hybrid operation accuracy reached was 99.24 %,
“Google” Lithuanian Speech Recognition Efficiency 603
when the recognizer recognizes voice commands out of 12 known speakers, and was
equal to 99.18 %, when it was applied to the unknown speaker.
[10] Same researchers evaluated the use of two recognizers for Lithuanian medical
desease digit code corpus and achieved around 96 % accuracy for clean speech. Similar
approach to [9] was used in [11, 12]. In the first work the highest accuracy (98.16 %)
was obtained when k-Nearest neighbors method was used with 15 nearest neighbors, in
the second they have achieved 95 % accuracy by combining 4 different language recog‐
nizers.
[13] proposes a method for features quality estimation that does not require recog‐
nition experiments and accelerate automatic speech recognition system development.
The key component of method is usage of metrics right after front-end features compu‐
tation. The experimental results show that method is suitable for recognition systems
with back-end Euclidean space classifiers.
3 Experimental Setup
“Google” Lithuanian speech recognition efficiency evaluation research method (Fig. 1)
consists of three parts:
1. First of all to process all voice records without adding any noise.
2. Process all voice records with several different types of noise, modified so as to get
some predefined signal to noise ratios (SNR).
3. “Google” announced its advancements in deep learning, a type of artificial intelli‐
gence, for speech recognition [2], so it’s appropriate to reprocess all voice records
without any additional noise after one month and to assess improvements in the
quality of the speech recognition.
Fig. 1. “Google” speech recognition efficiency evaluation research method
Two methods are needed for processing audio records with “Google” speech recog‐
niton:
1. Method # 1. One computer may send many (it depends on thread count) audio
records to “Google” speech recognizer at once. Audio format must be FLAC so they
604 D. Sipavičius and R. Maskeliūnas
must be properly pre-processed. An audio adapter is unnecessary. We cannot rely
only on this solution because the audio processing can be stopped at any time if
“Google” disables the active key.
All records which format is FLAC are selected from user defined folder. Then
unprocessed audio record is selected and from text file (audio record name and text
file name are the same) is read information about audio record (noise type, SNR,
sample rate). “Google” speech recognizer is processing received audio record and
returns speech recognition result in JSON format. Received speech recognition result
and information about noise record, SNR (if noise has been used) are saved in DB.
Then speech recognition resuls are processed (decomposed to primary and alterna‐
tive results, all numerals are transformed into text, calculated WER). Processed
audio record and text file are deleted from file system.
2. Method # 2. One computer may send only one audio record (audio format does not
matter) to “Google” speech recognizer at once. Audio records processing may be
accelerated by connecting more computers. Audio adapter is required when
processing audio records by this method (experimental software is playing audio
record, “Google”: speech recognizer is listening for microphone).
When audio records processing runs it automatically opens web site in “Google
Chrome” browser and initiates communication component, which initiates webkit‐
SpeechRecognition component. Then voice record and, if needed, noise record and
SNR are selected. WebkitSpeechRecognition component gets message to listen
microphone and process sound which is playing. WebkitSpeechRecognition compo‐
nent starts listening sound from microphone while he gets message about audio
record playing is finished. If needed noise record is playing at necessary SNR while
audio record is playing (noise record and audio record are playing at one time). When
audio record is finished, noise record stops playing. WebkitSpeechRecognition
component stops listening microphone and is waiting for speech recognition results.
“Google” speech recognizer is processing received audio record and returns speech
recognition result in JSON format. Received speech recognition result and infor‐
mation about noise record, SNR (if noise has been used) are saved in DB. Then
speech recognition resuls are processed (decomposed to primary and alternative
results, all numerals are transformed into text, calculated WER).
Necessary environment for realization. Experimental software consists of (Fig. 2):
Fig. 2. Decision context diagram
“Google” Lithuanian Speech Recognition Efficiency 605
•Data entry subsystem – implemented to import existing voice records to DB, to
record new voice records, to describe voice records with additional information.
Participant of experiment and executor of experiment uses this subsystem. Imple‐
mented with ASP.NET for participant of experiment use cases and “Windows
Forms” for executor of experiment use cases.
•Audio records processing subsystem – responsible for processing audio records.
Only executor of experiment uses this subsystem. Detailed information below.
•Results processing subsystem – processes the received speech recognition results
and presents the aggregated information. Only executor of experiment uses this
subsystem. Implemented with “Windows Forms”.
Audio records processing subsystem consists of:
1. Audio records processing (Method # 1)
•Audio records preparation for processing – this process converts all needed voice
records to FLAC format, also all voice records combines with all possible noise types
modified so as to get some predefined SNRs.
•Audio records processing – this process sends many audio records (along with audio
format and sample rate) to “Google” speech recognizer at once.
2. Audio records processing (Method # 2)
•Audio records processing – this process takes voice record from DB, runs browser
“Google Chrome” in automatic microphone use state and opens web site with
communication component and webkitSpeechRecognition component. Through
communication component communicates with webkitSpeechRecognition compo‐
nent and saving speech recognition results. Implemented with “Windows Forms”.
•Communication component – this component is like intermediary between audio
record process and webkitSpeechRecognition component. Implemented with
ASP.NET SignalR.
•webkitSpeechRecognition – “Google” speech recognition client side component that
works in browser “Google Chrome”. It processes audio record and returns speech
recognition resuls (text).
Metrics. Choosing appropriate metrics to track the quality of the system is critical to
success [14]. The common metrics that used to evaluate the quality of the recognizer
[14]:
•Word Error Rate (WER) – measures misrecognitions at the word level: it compares
the words outputted by the recognizer to those the user really spoke. Every error
(substitution, insertion or deletion) is counted against the recognizer [14].
(1)
•Semantic Quality (WebScore) – tracking the semantic quality of the recognizer
(WebScore) by measuring how many times the search result as queried by the recog‐
nition hypothesis varies from the search result as queried by a human transcription
606 D. Sipavičius and R. Maskeliūnas
[14]. A better recognizer has a higher WebScore [14]. The WebScore gives a much
clearer picture of what the user experiences when they search by voice [14]. In all
research authors tend to focus on optimizing this metric, rather than the more tradi‐
tional WER metric defined above [14].
•Perplexity (PPL) – a measure of the size of the set of words that can be recognized
next, given the previously recognized words in the query [14]. This gives a rough
measure of the quality of the language model – the lower the perplexity, the better
the model is at predicting the next word [14].
•Out-of-Vocabulary (OOV) Rate – tracks the percentage of words spoken by the
user that are not modeled by our language model [14]. It is important to keep this
number as low as possible, because any word spoken by users that is not in vocabulary
will ultimately result in a recognition error; furthermore, these recognition errors may
also cause errors in surrounding words due to the subsequent poor predictions of the
language model and acoustic misalignments. [14].
•Latency – is defned as the total time (in seconds) it takes to complete a search request
by voice (the time from when the user finishes speaking until the search results appear
on the screen) [14]. Many factors contribute to latency as perceived by the user [14]:
(a) the time it takes the system to detect end-of-speech, (b) the total time to recognize
the spoken query, (c) the time to perform the web query, (d) the time to return the
web search results back to the client over the network, and (e) the time it takes to
render the search results in the browser of the users phone.
4 Experimental Research
4.1 Experimental Data
An experiment was accomplished with 63299 voice records, the total duration of these
records is 86.06 h, average duration per record is 4.89 s. 359 speakers participated in
the experiment: male – 111 (30.92 %), female – 248 (69.08 %).
Detailed information about voice records is presented in Tables 1 and 2.
Table 1. Information about voice records
Speech attribute Yes No
Pcs. % Pcs. %
Smooth speech 61930 97.84 1369 2.16
Speech with accent 1640 2.59 61659 97.41
Fast speech 23773 37.56 39526 62.44
Loud speech 62869 99.32 430 0.68
Table 2. Voice records distribution by type of speech
Isolated words Connected words Continuous speech Spontaneous speech
Pcs. % Pcs. % Pcs. % Pcs. %
39349 62.16 35 0.06 23887 37.74 28 0.04
“Google” Lithuanian Speech Recognition Efficiency 607
All voice records were processed without noise and with 4 different types of noise
(train station, traffic, car driving and white noise), modified so as to get some predefined
signal to noise ratios (SNR): 25 dB, 20 dB, 15 dB and 10 dB.
4.2 The Results of the Experiment
Speech recognition results (without noise). After processing all 63299 for the first
time (without any noise) for 20784 voice records there was no speech recognition result
at all. In this subsection statistic is presented for 42515 voice records (238885 phrases),
for which “Google” speech recognition returned results. Average WER for all 42515
voice records is 40.74 % (Table 3). This value is obtained by calculating all voice records
with speech recognition results average of WER values. WER standard deviation
1
37.70 ± 0.30 (± 0.30 value is calculated confidence interval with a 90 % probability that
the value is in this range).
Table 3. Speech recognition results by speakers
Speech recognition results by speakers
Average Best Average of 3 best
speakers
Worst Average of 3
worst speakers
WER, % 40.74 10.00 14.74 100.00 96.39
Speech recognition results by speaker’s gender: female − 39.91 % (WER standard devi‐
ation − 38.54 ± 0.36), male − 42.96 % (WER standard deviation − 35.27 ± 0.54).
Speech recognition results by speech attribute are: not smooth speech − 38.71 %
(WER standard deviation − 34.91 ± 1.75), smooth speech − 40.79 % (WER standard
deviation − 37.77 ± 0.31); speech without accent − 40.56 % (WER standard deviation
− 37.76 ± 0.30), speech with accent − 48.39 % (WER standard deviation
− 34.25 ± 1.81); fast speech − 42.33 (WER standard deviation − 39.06 ± 0.53), normal
speech − 39.90 % (WER standard deviation − 36.95 ± 0.36); quiet speech − 43.59 %
(WER standard deviation − 27.96 ± 3.49), loud speech − 40.73 % (WER standard
deviation − 37.74 ± 0.30) (Table 4).
Table 4. Speech recognition results by speech type
Speech type Speech recognition results
Number of phrases Words of phrases Average WER, % WER standard
deviation
Isolated words 22206 40555 31.55 43.29 ± 0,48
Connected words 12 246 64.55 30.98 ± 14,71
Continuous speech 20297 198084 50.77 27.09 ± 0,31
Spontaneous
speech
0– – –
1The standard deviation is a numerical value used to indicate how widely individuals in a group
vary. If individual observations vary greatly from the group mean, the standard deviation is
big; and vice versa.
608 D. Sipavičius and R. Maskeliūnas
It could be seen that speech type has significant impact on the quality of speech
recognition: best speech recognition results are with isolated words, connected words
with maximum average WER (12 out of 35 voice records), the worst – spontaneous
speech (no of 28 voice records was recognized).
The worst recognized phrases are (average WER is 2.0): penktas (recognized as
“tanki test”); išsijunk (recognized as “iš trijų”); sviestas (recognized as “speed test”);
lizosoma (recognized as “visos stoma”); taikyk (recognized as “tai kiek”); dešinėn
(recognized as “dešimt min”); padidink (recognized as “lady zippy”); šunkelis (recog‐
nized as “šunų kelis”); išsijunk (recognized as “iš jų”); bjaurus (recognized as “į eurus”);
rnr (recognized as “prie neries”); aštuntas (recognized as “pašto kodas”).
Speech recognition results (with noise). In this subsection presented speech recog‐
nition results when all 63299 voice records were processed with noise (4 different noise
types, 4 different SNRs).
Figure 3 shows how recognized voice records quantity dependends on the noise type
and SNR. It could be seen that most of speech recognition results (not necessarily
correct) are with SNR at 25 dB (exeption – noise “Car driving” which most of speech
recognition results are with SNR at 20 dB).
Fig. 3. Recognized voice records quantity dependence on the noise type and SNR
Figure 4 shows how WER dependends on the noise type and SNR. It could be seen
that best speech recognition is when SNR = 25 dB (that is with the weakest noise).
“Google” speech recognition assessment of improvements. After processing all
63299 voice records after one month (without noise) for 20868 voice records there was
no speech recognition result (84 records more than one month ago). Average WER for
42431 recognized voice records is 40.82 %. It could be seen that average WER is 0.08 %
worse than one month ago.
“Google” Lithuanian Speech Recognition Efficiency 609
5 Conclusions and Future Works
After experimental software testing it was noticed that numerals in speech recognition
result can be provided as numbers, text and numbers with text (etc. “1940”, “tūkstantis
devyni šimtai keturiasdešimtieji”, “1940-ieji”, “1940i”) even multiple times processing
the same phrase. That makes some trouble if you want to compare text that was spoken
and the result of speech recognizer (it needs to replace numbers to text in all speech
recognition results), but semantically speech recognition results can be correct.
On first stage of experiment it was noticed that from processed 63299 audio records
there were returned no results for 20784 records and this takes 32.83 % of all records.
On the third stage of experiment, when it was hoped that speech recognizer has trained
processing recordings from previous stages and results would be better, there were
returned no results for 84 records more than on stage one.
Average WER value for all speech records that were proceesed by “Google” speech
recognizer and has results is 40.74 % with standard deviation at 37.70 %.
On the basis of experiment results analysis we can predicate that speaker gender
makes little impact on speech recognition quality – average WER variation between
man and woman is 3.05 %. Also for speech recognition quality has little impact smooth
speech, speech speed, speech volume (average WER variation between comparable
groups varies from 1.49 up to 2.86 %), but speech with accent result is worse by 7.83 %.
Big impact for speech recognition has speech type: the best is recognized isolated words
and the worst is spontaneous speech.
After completion of speech recognition experiment with 4 different signal to noise
ratios (SNR) it was noticed, that best speech recognition is when SNR is 25 dB (with
lowest noise level). Most speech recognition results (not including correctness) were
returned when SNR was 25 dB (exception – noise “car driving”, which has got most
results when SNR was 20 dB).
40.00
45.00
50.00
55.00
60.00
65.00
70.00
5 1015202530
Average WER, %
SNR
White noise Train station
Car driving Traffic
Fig. 4. WER dependence on the noise type and SNR
610 D. Sipavičius and R. Maskeliūnas
Experimental research results showed that quite often case one of the alternative
results is more correct than primary result. This means that speech recognition results
could be better if “Google” speech recognizer more precisely estimate which of the
results should be final.
A month later after accomplishing processing speech records without noise they were
processed repeatedly and the results showed that “Google” speech recognizer recogni‐
tion quality hasn’t got better, because speech recognition results (average WER) got by
0.08 % worse than at initial experiment got value. While “Google” speech recognizer is
free we can make assumption that “Google” is seeing speech recognizer drawbacks and
speech recognizer will be improved.
Experiment speech records should be the same size in every classified group to
increase confidence of experiment results.
References
1. Telksnys, A.L., Navickas, G.: Žmonių ir kompiuterių sąveika šnekant. In: Kompiuterininkų
dienos - 2015, ISBN: 9789986343134, pp. 185–193. Žara. Vilnius (2015)
2. Google says its speech recognition technology now has only an 8 % word error rate. http://
venturebeat.com/2015/05/28/google-says-its-speech-recognition-technology-now-has-
only-an-8-word-error-rate/, 25 Apr. 2016
3. Maskeliunas, R., Ratkevicius, K., Rudzionis, V.: Some aspects of voice user interfaces
development for internet and computer control applications. Elektronika ir elektrotechnika
19(2), 53–56 (2013). ISSN 1392-1215
4. Rudzionis, V., Ratkevicius, K., Rudzionis, A., Maskeliunas, R., Raskinis, G.: Voice
controlled interface for the medical-pharmaceutical information system. In: Skersys, T.,
Butleris, R., Butkiene, R. (eds.) ICIST 2012. CCIS, vol. 319, pp. 288–296. Springer,
Heidelberg (2012). ISBN: 9783642333071
5. Rudzionis, V., Raskinis, G., Maskeliunas, R., Rudzionis, A., Ratkevicius, K.: Comparative
analysis of adapted foreign language and native lithuanian speech recognizers for voice user
interface. Elektronika ir elektrotechnika 19(7), 90–93 (2013). ISSN 1392-1215
6. Rudžionis, V., Ratkevičius, K., Rudžionis, A., Raškinis, G., Maskeliunas, R.: Recognition of
voice commands using hybrid approach. In: Skersys, T., Butleris, R., Butkiene, R. (eds.)
ICIST 2013. CCIS, vol. 403, pp. 249–260. Springer, Heidelberg (2013)
7. Rudzionis, V., Raskinis, G., Maskeliunas, R., Rudzionis, A., Ratkevicius, K., Bartisiute, G.:
Web services based hybrid recognizer of lithuanian voice commands. Elektronika ir
elektrotechnika 20(9), 50–53 (2014). ISSN 1392-1215
8. Rudžionis, V., Raškinis, G., Ratkevičius, K., Rudžionis, A., Bartišiūtė, G.: Medical –
pharmaceutical information system with recognition of Lithuanian voice commands. In:
Human language technologies. In: The Baltic Perspective: Proceedings of the 6th
International Conference. ISBN: 978161499441, pp. 40–45. IOS Press. Amsterdam (2014)
9. Bartišiūtė, G., Ratkevičius, K., Paškauskaitė, G.: Hybrid recognition technology for isolated
voice commands. In: Information Systems Architecture and Technology: Proceedings of 36th
International Conference on Information Systems Architecture and Technology – ISAT 2015
– Part IV, ISBN 978-3-319-28565-8, pp. 207–216 (2016)
10. Bartišiūtė, G., Paškauskaitė, G., Ratkevičius, K.: Investigation of disease codes recognition
accuracy. In: Proceedings of the 9th International Conference on Electrical and Control
Technologies, ECT 2014, pp. 60–63 (2014)
“Google” Lithuanian Speech Recognition Efficiency 611
11. Rasymas, T., Rudžionis, V.: Evaluation of methods to combine different speech recognizers.
In: Computer Science and Information Systems (FedCSIS), pp. 1043–1047 (2015)
12. Rasymas, T., Rudžionis, V.: Lithuanian digits recognition by using hybrid approach by
combining lithuanian google recognizer and some foreign language recognizers. In:
Information and Software Technologies, ISBN 978-3-319-24769-4, pp 449–459 (2015)
13. Lileikytė, R., Telksnys, A.L.: Metrics based quality estimation of speech recognition features.
Informatica Vilnius, Matematikos ir informatikos institutas 24(3), 435–446 (2013). ISSN:
0868-4952
14. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garret, M.,
Strope, B.: Google Search by Voice: A case study
612 D. Sipavičius and R. Maskeliūnas