ArticlePDF Available

Abstract and Figures

The idea of this paper is to design a tool that will be used to test and compare commercial speech recognition systems, such as Microsoft Speech API and Google Speech API, with open-source speech recognition systems such as Sphinx-4. The best way to compare automatic speech recognition systems in different environments is by using some audio recordings that were selected from different sources and calculating the word error rate (WER). Although the WER of the three aforementioned systems were acceptable, it was observed that the Google API is superior.
Content may be subject to copyright.
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 20|P a g e
Comparing Speech Recognition Systems (Microsoft API, Google
API And CMU Sphinx)
Veton Këpuska1, Gamal Bohouta2
1,2(Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, FL, USA
ABSTRACT
The idea of this paper is to design a tool that will be used to test and compare commercial speech recognition
systems, such as Microsoft Speech API and Google Speech API, with open-source speech recognition systems
such as Sphinx-4. The best way to compare automatic speech recognition systems in different environments is
by using some audio recordings that were selected from different sources and calculating the word error rate
(WER). Although the WER of the three aforementioned systems were acceptable, it was observed that the
Google API is superior.
Keywords: Speech Recognition, Testing Speech Recognition Systems, Microsoft Speech API, Google Speech
API, CMU Sphinx-4 Speech Recognition.
I. INTRODUCTION
Automatic Speech Recognition (ASR) is
commonly employed in everyday applications. “One
of the goals of speech recognition is to
allow natural communication between humans and
computers via speech, where natural implies
similarity to the ways humans interact with each
other” [8]. ASR has provided many systems that
have been used to increase the interaction experience
between users and computers. According to Dale
Isaacs, “Today automatic speech recognition (ASR)
systems and text-to-speech (TTS) systems are quite
well established. These systems, using the latest
technologies, are operating at accuracies in excess of
90%” [6]. Due to the increasing number of ASR
systems, such as Microsoft, Google, Sphinx, WUW,
HTK and Dragon, it becomes very difficult to know
which of them we need. However, this paper shows
the results of testing Microsoft API, Google API,
and Sphinx4 by using a tool that has been designed
and implemented using Java language with some
audio recordings that were selected from a large
number of sources. Also, in comparing those
systems a number of various components were
utilized and evaluated such as the acoustic model,
the language model, and the dictionary.
There are a number of commercial and
open-source systems such as AT&T Watson,
Microsoft API Speech, Google Speech API,
Amazon Alexa API, Nuance Recognizer, WUW,
HTK and Dragon [2]. Three systems were selected
for our evaluation in different environments:
Microsoft API, Google API, and Sphinx-4 automatic
speech recognition systems. Two of the biggest
companies building voice-powered applications are
Google and Microsoft [4]. The Microsoft API and
Google API are the commercial speech recognition
systems whose code is inaccessible, and
Sphinx-4 is one of the ASR systems whose code is
freely available for download [3].
II. THE CMU SPHINX
The Sphinx system has been developed at
Carnegie Mellon University (CMU). Currently,”
CMU Sphinx has a large vocabulary, speaker
independent speech recognition codebase, and its
code is available for download and use” [13]. The
Sphinx has several versions and packages for
different tasks and applications such as Sphinx-2,
Sphinx-3 and Sphinx-4. Also, there are additional
packages such as Pocketsphinx, Sphinxbase,
Sphinxtrain. In this paper, the Sphinx-4 will be
evaluated. The Sphinx-4 has been written by Java
programming language. Moreover,” its structure has
been designed with a high degree of flexibility and
modularity” [13]. According to Juraj Kačur, “The
latest Sphinx-4 is written in JAVA, and Main
theoretical improvements are: support for finite
grammar called Java Speech API grammar, it
doesn’t impose the restriction using the same
structure for all models” [13] [5]. There are three
main components in the Sphinx-4 structure, which
includes the Frontend, the Decoder and the Linguist.
According to Willie Walker and other who have
worked in Sphinx-4, "we created a number of
differing implementations for each module in the
framework. For example, the Frontend
implementations support MFCC, PLP, and LPC
feature extraction; the Linguist implementations
support a variety of language models, including
CFGs, FSTs, and N-Grams; and the Decoder
supports a variety of Search Manager
implementations" [1]. Therefore, Sphinx-4 has the
most recent version of an HMM-based speech and a
RESEARCH ARTICLE OPEN ACCESS
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 21|P a g e
strong acoustic model by using HHM model with
training large vocabulary [2].
III. THE GOOGLE API
Google has improved its speech recognition
by using a new technology in many applications
with the Google App such as Goog411, Voice
Search on mobile, Voice Actions, Voice Input
(spoken input to keypad), Android Developer APIs,
Voice Search on desktop, YouTube transcription and
Translate, Navigate, TTS.
After Google, has used the new technology
that is the deep learning neural networks, Google
achieved an 8 percent error rate in 2015 that is
reduction of more than 23 percent from year 2013.
According to Pichai, senior vice president of
Android, Chrome, and Apps at Google, “We have
the best investments in machine learning over the
past many years. Indeed, Google has acquired
several deep learning companies over the years,
including DeepMind, DNNresearch, and
Jetpac[11].
IV. THE MICROSOFT API
Microsoft has developed the Speech API
since 1993, the company hired Xuedong (XD)
Huang, Fil Alleva, and Mei-Yuh Hwang “three of
the four people responsible for the Carnegie Mellon
University Sphinx-II speech recognition system,
which achieved fame in the speech world in 1992
due to its unprecedented accuracy. the first Speech
API is (SAPI) 1.0 team in 1994” [12].
Microsoft has continued to develop the
powerful speech API and has released a series of
increasingly powerful speech platforms. The
Microsoft team has released the Speech API (SAPI)
5.3 with Windows Vista which was very powerful
and useful. On the developer front, "Windows Vista
includes a new WinFX® namespace,
System.Speech. This allows developers to easily
speech-enable Windows Forms applications and
apps based on the Windows Presentation
Framework"[12].
Microsoft has focused on increasing
emphasis on speech recognition systems and
improved the Speech API (SAPI) by using a context-
dependent deep neural network hidden Markov
model (CD-DNN-HMM). According to the
researchers who have worked with Microsoft to
improve the Speech API and the CD-DNN-HMM
models, they determined that the large-vocabulary
speech recognition that achieves substantially better
results than a Context-Dependent Gaussian Mixture
Model Hidden Markov mode12]. Just recently
Microsoft announced “Historic Achievement:
Microsoft researchers reach human parity in
conversational speech recognition” [15].
V. EXPERIMENTS
The best way to test the quality of various
ASR systems is to calculate the word error rate
(WER). According to the WER, we can also test the
different models in the ASR systems, such as the
acoustic model, the language model, and the
dictionary size. However, in this paper we have
developed a tool that we have used to test these
models in Microsoft API, Google API, and Sphinx-
4. Also, we have calculated the WER by using this
tool to recognize a list of sentences, which we
collected in the form of audio files and text
translation. In this paper, we follow these steps to
design the tool and test Microsoft API, Google API,
and Sphinx-4.
VI. TESTING DATA
The audio files were selected from various
sources to evaluate the Microsoft API, Google API,
and Sphinx-4. According to CMUSphin, Sphinx-4's
decoder supports only one of the two specific audio
formats (16000 Hz / 8000 Hz) [13]. Also, Google
does not recognize the WAV format generally used
with Sphinx-4. Part of the process of recognizing
WAV files with Google involves converting the
WAV files to the FLAC format. Microsoft can
recognize any WAV files format. However, we
solved this problem by making our tool recognize all
audio files in the same format (16000 Hz / 8000 Hz).
Some of the audio files have been selected
from the TIMIT corpus. The TIMIT corpus of read
speech is designed to provide speech data for
acoustic-phonetic studies and for the development
and evaluation of automatic speech recognition
systems. TIMIT contains broadband recordings of
630 speakers of eight major dialects of American
English, each reading ten phonetically rich
sentences [14]. The TIMIT corpus includes time-
aligned orthographic, phonetic and word
transcriptions as well as a 16-bit, 16kHz speech
waveform file for each utterance. Corpus design was
a joint effort among the Massachusetts Institute of
Technology (MIT), SRI International (SRI) and
Texas Instruments, Inc. (TI) [9].
Also, we have selected other audio files
from ITU (International Telecommunication Union)
which is the United Nations Specialized Agency in
the field of telecommunications [10]. Example of
some of the audio files are presented in the table1
below:
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 22|P a g e
Table 1. The Audio Files
VII. SYSTEM DESCRIPTION
This system has been designed by using the
Java language, which is the same language that has
been used in Sphinx-4, as well as the C# that was
used to test the Microsoft API and Google API.
Also, we have used several libraries such as Text to
Speech API, Graph API and Math API for different
tasks. Moreover, this tool was connected with the
classes of Sphinx4, Microsoft API and Google API
to work together to recognize the audio files. Then
we compared the recognition results with the
original recording texts.
Figure 1. The System Interface.
VIII. EXPERIMENTAL RESULTS
The audio recordings with the original
sentences were used to test the Sphinx-4, Microsoft
API, and Google API. By using our tool, we have
tested all files and calculated the word error rate
(WER) and accuracy. We calculated the word error
rate (WER) and accuracy according to these
equations.
WER = (I + D + S) / N
WER = (0 + 0 + 1) / 9 = 0.11
where I words were inserted, D words were deleted,
and S words were substituted.
The original text (Reference):
the small boy PUT the worm on the hook
The recognition text (Hypothesis):
the small boy THAT the worm on the hook
Accuracy = (N - D - S) / N
WA = (9 + 0 + 1) / 9 = 0.88
The original text (Reference):
the coffee STANDARD is too high for the couch
The recognition text (Hypothesis):
the coffee STAND is too high for the couch
Figure 2. The Structure of The System.
Figure 3. The Result of Sphinx-4
By using our tool, we have gathered data and
results are as follows: The Sphinx-4 (37% WER),
Google Speech API (9% WER) and Microsoft
Speech API (18% WER). Where S sentences, N
words, I words were inserted, D words were deleted,
and S words were substituted. CW correct words,
EW error words.
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 23|P a g e
Table 3. The Final Results of Sphinx-4
Table 4. The Final Results of Microsoft API
Table 5. The Final Results of Google API
Table 6. Comparison Between Three Systems
Figure 4. Comparison Between Three Systems
IX. CONCLUSION
In this paper, it can be concluded that the
tool that we have built to test the Sphinx-4,
Microsoft API, and Google API by using some
audio recordings that were selected from many
places with the original sentences showed that
Sphinx-4 achieved 37% WER, Microsoft API
achieved 18% WER and Google API achieved 9%
WER. Therefore, it can be stated that the acoustic
modeling and language model of Google is superior.
REFERENCES
[1]. W. Walker, P. Lamere, P. Kwok, B. Raj, R.
Singh, E. Gouvea, P. Wolf, and J. Woelfel,
Sphinx-4: A Flexible Open Source
Framework for Speech Recognition, Sun
Microsystems, SMLI TR-2004-139, 2004,1-
14
[2]. C. Gaida, P. Lange, R. Petrick, P. Proba, A.
Malatawy, and D. Suendermann-Oeft,
Comparing Open-Source Speech Recognition
Toolkits. The Baden-Wuerttemberg Ministry
of Science and Arts as part of the research
project, 2011
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 24|P a g e
[3]. K. Samudravijaya and M. Barol, Comparison
of Public Domain Software Tools for Speech
Recognition. ISCA Archive, 2013
[4]. P. Lange and D. Suendermann, Tuning
Sphinx to Outperform Google’s Speech
Recognition API, The Baden-Wuerttemberg
Ministry of Science and Arts as part of the
research project.
[5]. J. Kačur, HTK vs. Sphinx for Speech
Recognition. Department of
telecommunication FEI STU.
[6]. D. Isaacs and D. Mashao, A Comparison of
the Network Speech Recognition and
Distributed Speech Recognition Systems and
their eect on Speech Enabling Mobile
Devices, doctoral diss. Speech Technology
and Research Group, University of Cape
Town, 2010
[7]. R. Srikanth, L. Bo and J. Salsman, Automatic
Pronunciation Evaluation and
Mispronunciation Detection Using
CMUSphin. COLING, 2012, 61-68
[8]. V. Kepuska, Wake-Up-Word Speech
Recognition. IN TECH, 2011
[9]. STAR. (2016) SRI International's Speech
Technology and Research (STAR)
Laboratory. SRI, http://www.speech.sri.com/.
[10]. ITU. (2016) Committed to connecting the
world. ITU, http://www.itu.int//.
[11]. V. Beat and J. Novet (2016) Google says its
speech recognition technology now has only
an 8% word error rate. Venture beat,
http://venturebeat.com/2015/05/28/.
[12]. Microsoft Corporation (2016) Exploring New
Speech Recognition and Synthesis APIs In
Windows Vista. Microsoft,
http://web.archive.org/.
[13]. CMUSphinx (2016) CMUSphinx Tutorial for
Developers. Carnegie Mellon University,
http://www.speech.cs.cmu.edu/sphinx/.
[14]. TIMIT (2016) TIMIT Acoustic-Phonetic
Continuous Speech Corpus. Linguistic Data
Consortium,
https://catalog.ldc.upenn.edu/LDC93S1.
[15]. Microsoft Corporation (2016) Historic
Achievement: Microsoft researchers reach
human parity in conversational speech
recognition”, https://blogs.microsoft.com.
... Participants were asked to use this particular workflow on a paragraph of hard-copy text. Third, voice typing was tested by using the Gboard 8 application, as it offers cutting-edge speech-to-text technology, it is free, and is available for both iOS and Android operating systems [42]. Finally, the voice commands feature was assessed by using the Siri and Google Assistant features provided by the iOS and Android operating systems, respectively. ...
Article
Full-text available
The usage of smartphones is increasingly widespread, and the usefulness of mobile applications as low-vision aids is evident but not thoroughly examined. In this study, we surveyed people with low vision to assess the usability of common, preloaded mobile applications, to evaluate the usage of typical assistive technologies of smartphones, and to measure the usefulness, and usability of recent software advancements that can be used as visual aids. We invited 134 low-vision individuals to participate, and 45 of them met the eligibility criteria and completed an in-person survey. The eligibility criteria were as follows: aged 18 years or older and mentally competent, visual acuity worse than 0.4 logMAR with best-corrected glasses in the better-seeing eye, ownership of a smartphone and familiarity with visual assistive technologies. All testing scenarios were carried out using the participants' smartphones, either with Android or iOS operating systems. Participants reported the usefulness and ease of use for common visual display enhancements (i.e., text size, bold text, increased contrast, inverted colors, and dark mode), audio feedback capabilities, four primary preloaded apps ( Dialer , Clock , Calculator , and Calendar ), and four usage scenarios that serve as low-vision aids (magnify with camera, hard-copy text-to-speech, voice typing, and voice commands). Participants also indicated whether they could use the apps or execute the scenarios independently. The Dialer and Clock apps, text enhancements, camera magnification, and voice typing were rated as highly useful, while the Calendar application received lower ratings. Most of the selected apps or services were rated as easy to use, with lower ratings recorded for the Calendar and Select to Speak ones. Considering the positive results across all options, this collection of apps and services proved useful for all age groups, regardless of gender, technological familiarity, or education. The feedback received in this study can help towards improving the everyday lives of low-vision people as well as informing the design of apps and assistive features, guiding future research and development to enhance visual accessibility on mobile computing devices.
... The WER values were saved in a.CSV file and used for the median WER calculation as well as the value normalization. In previous literature, Microsoft Azure and Google Cloud had the lowest WER among their peers[87,88]. Within our test on 30 samples, Microsoft Azure again had the lowest median WER of 4.1. ...
Article
Full-text available
Over the past few decades, the topic of artificial intelligence (AI) has gained considerable attention in both research and industry. In particular, the healthcare sector has witnessed a surge in the use of AI applications, as the maturity of these methods increased. However, as the use of machine learning (ML) in healthcare continues to grow, we believe it will become increasingly important to examine public perceptions of this trend to identify potential impediments and future directions. Current work focuses mainly on academic data sources and industrial applications of AI. However, to gain a comprehensive understanding of the increased societal interest in AI, digital media such as podcasts should be consulted, as they are accessible to a broader audience. In order to examine this hypothesis, we investigate the AI trend development in healthcare from 2015 until 2021. In this study, we propose a web mining approach to collect a novel data set consisting of 29 healthcare podcasts with 3449 episodes. We identify 102 AI-related buzzwords that were extracted from various glossaries and hype cycles. These buzzwords were used to conduct an extensive trend detection and analysis study on the collected data using machine learning-based approaches. We successfully detect an AI trend and follow its evolution in healthcare podcasts over several years. Besides the focus area of AI, we are able to detect 14 topic clusters and visualize the trending or decreasing dominant topics over the whole period under consideration. In addition, we analyze the sentiments in podcasts towards the identified topics and deliver further insights for trend detection in healthcare. Finally, the collected data set can be used for trend detection besides AI-related topics using topic clustering.
... The first objective of this work is to use the Kaldi toolkit to create a speech recognition system for the Amazigh Isolated-Words and Amazigh digits (0-9). As a comparison, we used the HMM-GMM acoustic models with different values of Gaussians (8,16,and 32 GMMs) and MFCC coefficient trained with Kaldi and CMU Sphinx4 tools in order to establish a comparison in terms of recognition rate. To attain our objective we have performed two tests. ...
Article
In this work, we offer a new approach to integrating the Amazigh language, which is a less-resourced language, into an isolated speech recognition system by exploiting the Kaldi open-source platform. Our designed system is able to recognize the ten first Amazigh digits and ten daily must-used Amazigh isolated words, which present typical syllabic structure and which are considered a good representative sample of the Amazigh language. The designed speech system was implemented using Hidden Markov Models (HMMs) with different number of Gaussian distributions. In addition, we evaluated our created system performance by varying the feature extraction methods in order to determine the optimal method for maximum performance. The best-obtained result is 93.96% was obtained with Mel Frequency Cepstral Coefficients (MFCCs) technique.
... In addition, there are many speechto-text tools, each of them promising high-quality results. Kepuska and Bohouta [8] performed tests using three of the most popular models for speech recognition: one developed by Microsoft, one by Google, and one by Carnegie Mellon University. Their results suggest that Google's Speech Recognition Tool is the best, with a word error rate (WER) of 9%, compared to Microsoft's 18% and CMU's 37%. ...
Article
Full-text available
Film post-production can be time- and money-inefficient. The reason is that a lot of the work involves a person or group of people, called metadata taggers, going through each individual piece of media and marking it up with relevant tags, such as the scene number, transcripts, and the type of shot for video footage. Such a task is particularly time-consuming for films with high shooting ratios (i.e., footage shot/footage shown). AutoTag automates much of the tagging process across 16 languages, saving both time and money. We describe the algorithms and implementation of AutoTag and report on some case studies.
... Active research in ASR technologies has resulted in the availability of several APIs like AT&T Watson, Microsoft API Speech, Google Speech-To-Text, Amazon Alexa API, Nuance Recognizer, WUW, HTK, and Dragon [20]. However, their performance varies from language to language. ...
Article
Full-text available
News story segmentation is a challenging task mainly due to the dynamic range of topics, smooth story transitions, and varied duration of each story. This paper presents a technique to segment stories from Urdu news bulletins. The technique relies on a Long Short-Term Memory-based Siamese neural network that is trained on positive (belonging to the same story) and negative (belonging to different stories) pairs of sentences. The model, once trained, identifies the transition between stories by detecting the dissimilarity between the adjacent sentences of a given text. For algorithmic development and experimental study, we employ two datasets, a dataset of Urdu news as well as transcriptions of news bulletins from multiple news channels. Experiments report promising results in identifying story boundaries validating the ideas put forward in this study.
Chapter
Social robots are designed to support people through their capabilities such as information gathering, processing, analyzing, and predicting. Social robots play a vital role in various fields such as medical, entertainment, education, and assistance. Speech is a fundamental characteristic of social robots to establish communication with humans. The advancement of artificial intelligence has facilitated speech recognition tools to be substantially effective. It is easier to comprehend the meaning of a speech if it is documented. The speech recognition tools help robots in recognizing human speech. It is supposed that robots can precisely understand what humans are attempting to convey, however it is not achievable every time due to several factors such as constraints in terms of robot functionality or noise in the environment. There are research studies which indicate that speech recognition of children is a challenging problem for robots. The in-built speech recognition capabilities of such robots can be enhanced by integrating it with a more efficient speech recognition tool available in this domain. Therefore, it is necessary to select the appropriate speech recognition tool so that robots can understand human speech in a consistent way. In the present study we are analyzing five real-time speech-to-text recognition tools available from open sources: Google speech recognition, Vosk, CMUSphinx, DeepSpeech and Whisper. Evaluation metrics are generally used to evaluate the performance of speech recognition tools. This analysis will enable us to determine the best real time open-source tool to employ for robot-human interaction.KeywordsSocial robotspeech recognitionToolsArtificial intelligenceHuman-robot interactionHRIReal-time
Article
Artificial intelligence is a fundamental expertise in creating intelligent devices, particularly computer programmes. It is related to the well-known task of employing computers to comprehend human intellect. AI is roughly defined as the study of computations that enable perception, reasoning, and action. This paper describes a Python-based personal assistant for Windows-based platforms. It is essentially a software program that performs tasks or answers questions based on the user's instructions. It understands voice commands and performs various tasks using the command line interface. A user can get product reports, track products for price drops, get weather and news updates, check trending topics on Twitter, send email messages, read unread email messages of the user, download images, browse the Internet, open and close different applications all without using a keyboard. Humans have become increasingly dependent on computers as technology has advanced. Users have switched from text input to speech input to their assistants because they want their assistants to be smarter and customize their results. As a result, personal assistants are in greater demand. The user's productivity has increased to a great extent as he does not need to do the same regular tasks, thereby saving a lot of time and effort
Preprint
Full-text available
Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
Chapter
Full-text available
In order to show the versatility of the approach taken in the whole word HMM scoring with WUW-SR paradigm the next set of experiments applied a phonetic modeling approach. The experiments rely on the Texas Instruments and Massachusetts Institute of Technology (TIMIT) corpus. TIMIT is a standard data set that is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of ASR systems (John S. Garofolo, 1993). TIMIT contains recordings of 630 speakers in 8 dialects of U.S. English. Each speaker is assigned 10 sentences to read that are carefully designed to contain a wide range of phonetic variability. Each utterance is recorded as a 16-bit waveform file sampled at 16 KHz. The entire data set is split into two portions: TRAIN to be used to generate an SR baseline, and TEST that should be unseen by the experiment until the final evaluation. The WUW-SR system developed in this work provides for efficient and highly accurate speaker independent recognitions at performance levels not achievable by current state of the art recognizers. Extensive testing demonstrates accuracy improvements superior by several orders of magnitude over the best known academic speech recognition system, HTK, as well as a leading commercial speech recognition system. Specifically, the WUW-SR system correctly detects the WUW with 99.98% accuracy. It correctly rejects non-WUW with over 99.99% accuracy. The WUW system makes 12 errors in 151615 words or less than 0.008%. Assuming speaking rate of 100 words per minute it would make 0.47 false acceptance errors perhour, or one false acceptance in 2.1 hours. Comparison of WUW performance in detection and recognition performance is 2525%, or 26 times better than HTK for the same training & testing data, and 2,450%, or 25 times better than Microsoft SAPI 5.1 recognizer. The out-of-vocabulary rejection performance is over 65,233%, or 653 times better than HTK, and 5900% to 42,900%, or 60 to 430 times better than the Microsoft SAPI 5.1 recognizer.
Article
Full-text available
Sphinx-4 is a flexible, modular and pluggable framework to help foster new innovations in the core research of hidden Markov model (HMM) speech recognition systems. The design of Sphinx-4 is based on patterns that have emerged from the design of past systems as well as new requirements based on areas that researchers currently want to explore. To exercise this framework, and to provide researchers with a "research-ready" system, Sphinx-4 also includes several implementations of both simple and state-of-the-art techniques. The framework and the implementations are all freely available via open source.
Comparing Open-Source Speech Recognition Toolkits. The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project
  • C Gaida
  • P Lange
  • R Petrick
  • P Proba
  • A Malatawy
  • D Suendermann-Oeft
C. Gaida, P. Lange, R. Petrick, P. Proba, A. Malatawy, and D. Suendermann-Oeft, Comparing Open-Source Speech Recognition Toolkits. The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project, 2011
Tuning Sphinx to Outperform Google's Speech Recognition API, The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project
  • P Lange
  • D Suendermann
P. Lange and D. Suendermann, Tuning Sphinx to Outperform Google's Speech Recognition API, The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project.
Sphinx for Speech Recognition. Department of telecommunication FEI STU
  • J Kačur
J. Kačur, HTK vs. Sphinx for Speech Recognition. Department of telecommunication FEI STU.
A Comparison of the Network Speech Recognition and Distributed Speech Recognition Systems and their eect on Speech Enabling Mobile Devices, doctoral diss
  • D Isaacs
  • D Mashao
D. Isaacs and D. Mashao, A Comparison of the Network Speech Recognition and Distributed Speech Recognition Systems and their eect on Speech Enabling Mobile Devices, doctoral diss. Speech Technology and Research Group, University of Cape Town, 2010
Automatic Pronunciation Evaluation and Mispronunciation Detection Using CMUSphin
  • R Srikanth
  • L Bo
  • J Salsman
R. Srikanth, L. Bo and J. Salsman, Automatic Pronunciation Evaluation and Mispronunciation Detection Using CMUSphin. COLING, 2012, 61-68
Google says its speech recognition technology now has only an 8% word error rate
  • V Beat
  • J Novet
V. Beat and J. Novet (2016) Google says its speech recognition technology now has only an 8% word error rate. Venture beat, http://venturebeat.com/2015/05/28/.
CMUSphinx Tutorial for Developers
  • Cmusphinx
CMUSphinx (2016) CMUSphinx Tutorial for Developers. Carnegie Mellon University, http://www.speech.cs.cmu.edu/sphinx/.
SRI International's Speech Technology and Research (STAR)
  • Star
STAR. (2016) SRI International's Speech Technology and Research (STAR)