ArticlePDF Available

Abstract and Figures

The idea of this paper is to design a tool that will be used to test and compare commercial speech recognition systems, such as Microsoft Speech API and Google Speech API, with open-source speech recognition systems such as Sphinx-4. The best way to compare automatic speech recognition systems in different environments is by using some audio recordings that were selected from different sources and calculating the word error rate (WER). Although the WER of the three aforementioned systems were acceptable, it was observed that the Google API is superior.
Content may be subject to copyright.
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 20|P a g e
Comparing Speech Recognition Systems (Microsoft API, Google
API And CMU Sphinx)
Veton Këpuska1, Gamal Bohouta2
1,2(Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, FL, USA
ABSTRACT
The idea of this paper is to design a tool that will be used to test and compare commercial speech recognition
systems, such as Microsoft Speech API and Google Speech API, with open-source speech recognition systems
such as Sphinx-4. The best way to compare automatic speech recognition systems in different environments is
by using some audio recordings that were selected from different sources and calculating the word error rate
(WER). Although the WER of the three aforementioned systems were acceptable, it was observed that the
Google API is superior.
Keywords: Speech Recognition, Testing Speech Recognition Systems, Microsoft Speech API, Google Speech
API, CMU Sphinx-4 Speech Recognition.
I. INTRODUCTION
Automatic Speech Recognition (ASR) is
commonly employed in everyday applications. “One
of the goals of speech recognition is to
allow natural communication between humans and
computers via speech, where natural implies
similarity to the ways humans interact with each
other” [8]. ASR has provided many systems that
have been used to increase the interaction experience
between users and computers. According to Dale
Isaacs, “Today automatic speech recognition (ASR)
systems and text-to-speech (TTS) systems are quite
well established. These systems, using the latest
technologies, are operating at accuracies in excess of
90%” [6]. Due to the increasing number of ASR
systems, such as Microsoft, Google, Sphinx, WUW,
HTK and Dragon, it becomes very difficult to know
which of them we need. However, this paper shows
the results of testing Microsoft API, Google API,
and Sphinx4 by using a tool that has been designed
and implemented using Java language with some
audio recordings that were selected from a large
number of sources. Also, in comparing those
systems a number of various components were
utilized and evaluated such as the acoustic model,
the language model, and the dictionary.
There are a number of commercial and
open-source systems such as AT&T Watson,
Microsoft API Speech, Google Speech API,
Amazon Alexa API, Nuance Recognizer, WUW,
HTK and Dragon [2]. Three systems were selected
for our evaluation in different environments:
Microsoft API, Google API, and Sphinx-4 automatic
speech recognition systems. Two of the biggest
companies building voice-powered applications are
Google and Microsoft [4]. The Microsoft API and
Google API are the commercial speech recognition
systems whose code is inaccessible, and
Sphinx-4 is one of the ASR systems whose code is
freely available for download [3].
II. THE CMU SPHINX
The Sphinx system has been developed at
Carnegie Mellon University (CMU). Currently,”
CMU Sphinx has a large vocabulary, speaker
independent speech recognition codebase, and its
code is available for download and use” [13]. The
Sphinx has several versions and packages for
different tasks and applications such as Sphinx-2,
Sphinx-3 and Sphinx-4. Also, there are additional
packages such as Pocketsphinx, Sphinxbase,
Sphinxtrain. In this paper, the Sphinx-4 will be
evaluated. The Sphinx-4 has been written by Java
programming language. Moreover,” its structure has
been designed with a high degree of flexibility and
modularity” [13]. According to Juraj Kačur, “The
latest Sphinx-4 is written in JAVA, and Main
theoretical improvements are: support for finite
grammar called Java Speech API grammar, it
doesn’t impose the restriction using the same
structure for all models” [13] [5]. There are three
main components in the Sphinx-4 structure, which
includes the Frontend, the Decoder and the Linguist.
According to Willie Walker and other who have
worked in Sphinx-4, "we created a number of
differing implementations for each module in the
framework. For example, the Frontend
implementations support MFCC, PLP, and LPC
feature extraction; the Linguist implementations
support a variety of language models, including
CFGs, FSTs, and N-Grams; and the Decoder
supports a variety of Search Manager
implementations" [1]. Therefore, Sphinx-4 has the
most recent version of an HMM-based speech and a
RESEARCH ARTICLE OPEN ACCESS
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 21|P a g e
strong acoustic model by using HHM model with
training large vocabulary [2].
III. THE GOOGLE API
Google has improved its speech recognition
by using a new technology in many applications
with the Google App such as Goog411, Voice
Search on mobile, Voice Actions, Voice Input
(spoken input to keypad), Android Developer APIs,
Voice Search on desktop, YouTube transcription and
Translate, Navigate, TTS.
After Google, has used the new technology
that is the deep learning neural networks, Google
achieved an 8 percent error rate in 2015 that is
reduction of more than 23 percent from year 2013.
According to Pichai, senior vice president of
Android, Chrome, and Apps at Google, “We have
the best investments in machine learning over the
past many years. Indeed, Google has acquired
several deep learning companies over the years,
including DeepMind, DNNresearch, and
Jetpac[11].
IV. THE MICROSOFT API
Microsoft has developed the Speech API
since 1993, the company hired Xuedong (XD)
Huang, Fil Alleva, and Mei-Yuh Hwang “three of
the four people responsible for the Carnegie Mellon
University Sphinx-II speech recognition system,
which achieved fame in the speech world in 1992
due to its unprecedented accuracy. the first Speech
API is (SAPI) 1.0 team in 1994” [12].
Microsoft has continued to develop the
powerful speech API and has released a series of
increasingly powerful speech platforms. The
Microsoft team has released the Speech API (SAPI)
5.3 with Windows Vista which was very powerful
and useful. On the developer front, "Windows Vista
includes a new WinFX® namespace,
System.Speech. This allows developers to easily
speech-enable Windows Forms applications and
apps based on the Windows Presentation
Framework"[12].
Microsoft has focused on increasing
emphasis on speech recognition systems and
improved the Speech API (SAPI) by using a context-
dependent deep neural network hidden Markov
model (CD-DNN-HMM). According to the
researchers who have worked with Microsoft to
improve the Speech API and the CD-DNN-HMM
models, they determined that the large-vocabulary
speech recognition that achieves substantially better
results than a Context-Dependent Gaussian Mixture
Model Hidden Markov mode12]. Just recently
Microsoft announced “Historic Achievement:
Microsoft researchers reach human parity in
conversational speech recognition” [15].
V. EXPERIMENTS
The best way to test the quality of various
ASR systems is to calculate the word error rate
(WER). According to the WER, we can also test the
different models in the ASR systems, such as the
acoustic model, the language model, and the
dictionary size. However, in this paper we have
developed a tool that we have used to test these
models in Microsoft API, Google API, and Sphinx-
4. Also, we have calculated the WER by using this
tool to recognize a list of sentences, which we
collected in the form of audio files and text
translation. In this paper, we follow these steps to
design the tool and test Microsoft API, Google API,
and Sphinx-4.
VI. TESTING DATA
The audio files were selected from various
sources to evaluate the Microsoft API, Google API,
and Sphinx-4. According to CMUSphin, Sphinx-4's
decoder supports only one of the two specific audio
formats (16000 Hz / 8000 Hz) [13]. Also, Google
does not recognize the WAV format generally used
with Sphinx-4. Part of the process of recognizing
WAV files with Google involves converting the
WAV files to the FLAC format. Microsoft can
recognize any WAV files format. However, we
solved this problem by making our tool recognize all
audio files in the same format (16000 Hz / 8000 Hz).
Some of the audio files have been selected
from the TIMIT corpus. The TIMIT corpus of read
speech is designed to provide speech data for
acoustic-phonetic studies and for the development
and evaluation of automatic speech recognition
systems. TIMIT contains broadband recordings of
630 speakers of eight major dialects of American
English, each reading ten phonetically rich
sentences [14]. The TIMIT corpus includes time-
aligned orthographic, phonetic and word
transcriptions as well as a 16-bit, 16kHz speech
waveform file for each utterance. Corpus design was
a joint effort among the Massachusetts Institute of
Technology (MIT), SRI International (SRI) and
Texas Instruments, Inc. (TI) [9].
Also, we have selected other audio files
from ITU (International Telecommunication Union)
which is the United Nations Specialized Agency in
the field of telecommunications [10]. Example of
some of the audio files are presented in the table1
below:
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 22|P a g e
Table 1. The Audio Files
VII. SYSTEM DESCRIPTION
This system has been designed by using the
Java language, which is the same language that has
been used in Sphinx-4, as well as the C# that was
used to test the Microsoft API and Google API.
Also, we have used several libraries such as Text to
Speech API, Graph API and Math API for different
tasks. Moreover, this tool was connected with the
classes of Sphinx4, Microsoft API and Google API
to work together to recognize the audio files. Then
we compared the recognition results with the
original recording texts.
Figure 1. The System Interface.
VIII. EXPERIMENTAL RESULTS
The audio recordings with the original
sentences were used to test the Sphinx-4, Microsoft
API, and Google API. By using our tool, we have
tested all files and calculated the word error rate
(WER) and accuracy. We calculated the word error
rate (WER) and accuracy according to these
equations.
WER = (I + D + S) / N
WER = (0 + 0 + 1) / 9 = 0.11
where I words were inserted, D words were deleted,
and S words were substituted.
The original text (Reference):
the small boy PUT the worm on the hook
The recognition text (Hypothesis):
the small boy THAT the worm on the hook
Accuracy = (N - D - S) / N
WA = (9 + 0 + 1) / 9 = 0.88
The original text (Reference):
the coffee STANDARD is too high for the couch
The recognition text (Hypothesis):
the coffee STAND is too high for the couch
Figure 2. The Structure of The System.
Figure 3. The Result of Sphinx-4
By using our tool, we have gathered data and
results are as follows: The Sphinx-4 (37% WER),
Google Speech API (9% WER) and Microsoft
Speech API (18% WER). Where S sentences, N
words, I words were inserted, D words were deleted,
and S words were substituted. CW correct words,
EW error words.
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 23|P a g e
Table 3. The Final Results of Sphinx-4
Table 4. The Final Results of Microsoft API
Table 5. The Final Results of Google API
Table 6. Comparison Between Three Systems
Figure 4. Comparison Between Three Systems
IX. CONCLUSION
In this paper, it can be concluded that the
tool that we have built to test the Sphinx-4,
Microsoft API, and Google API by using some
audio recordings that were selected from many
places with the original sentences showed that
Sphinx-4 achieved 37% WER, Microsoft API
achieved 18% WER and Google API achieved 9%
WER. Therefore, it can be stated that the acoustic
modeling and language model of Google is superior.
REFERENCES
[1]. W. Walker, P. Lamere, P. Kwok, B. Raj, R.
Singh, E. Gouvea, P. Wolf, and J. Woelfel,
Sphinx-4: A Flexible Open Source
Framework for Speech Recognition, Sun
Microsystems, SMLI TR-2004-139, 2004,1-
14
[2]. C. Gaida, P. Lange, R. Petrick, P. Proba, A.
Malatawy, and D. Suendermann-Oeft,
Comparing Open-Source Speech Recognition
Toolkits. The Baden-Wuerttemberg Ministry
of Science and Arts as part of the research
project, 2011
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 24|P a g e
[3]. K. Samudravijaya and M. Barol, Comparison
of Public Domain Software Tools for Speech
Recognition. ISCA Archive, 2013
[4]. P. Lange and D. Suendermann, Tuning
Sphinx to Outperform Google’s Speech
Recognition API, The Baden-Wuerttemberg
Ministry of Science and Arts as part of the
research project.
[5]. J. Kačur, HTK vs. Sphinx for Speech
Recognition. Department of
telecommunication FEI STU.
[6]. D. Isaacs and D. Mashao, A Comparison of
the Network Speech Recognition and
Distributed Speech Recognition Systems and
their eect on Speech Enabling Mobile
Devices, doctoral diss. Speech Technology
and Research Group, University of Cape
Town, 2010
[7]. R. Srikanth, L. Bo and J. Salsman, Automatic
Pronunciation Evaluation and
Mispronunciation Detection Using
CMUSphin. COLING, 2012, 61-68
[8]. V. Kepuska, Wake-Up-Word Speech
Recognition. IN TECH, 2011
[9]. STAR. (2016) SRI International's Speech
Technology and Research (STAR)
Laboratory. SRI, http://www.speech.sri.com/.
[10]. ITU. (2016) Committed to connecting the
world. ITU, http://www.itu.int//.
[11]. V. Beat and J. Novet (2016) Google says its
speech recognition technology now has only
an 8% word error rate. Venture beat,
http://venturebeat.com/2015/05/28/.
[12]. Microsoft Corporation (2016) Exploring New
Speech Recognition and Synthesis APIs In
Windows Vista. Microsoft,
http://web.archive.org/.
[13]. CMUSphinx (2016) CMUSphinx Tutorial for
Developers. Carnegie Mellon University,
http://www.speech.cs.cmu.edu/sphinx/.
[14]. TIMIT (2016) TIMIT Acoustic-Phonetic
Continuous Speech Corpus. Linguistic Data
Consortium,
https://catalog.ldc.upenn.edu/LDC93S1.
[15]. Microsoft Corporation (2016) Historic
Achievement: Microsoft researchers reach
human parity in conversational speech
recognition”, https://blogs.microsoft.com.
... The pipeline of the sentiment analysis process is shown step by step in Fig. 3 At the beginning of the process (see Fig. 3) the steps can be found to identify, extract and select the text features [86]. Besides opinion words or phrases in the case of customer reports, parts of speech or the presence and frequency of terms can also be named as examples for features [86][87][88]. Hereafter, the final step of sentiment classification follows. In current literature, there are three different techniques differentiated: the ML approach focusing on ML algorithms and linguistic features, the lexicon-based approach based on a sentiment lexicon as well as a hybrid version as a combination of the previous two approaches. ...
... The WER values were saved in a.CSV file and used for the median WER calculation as well as the value normalization. In previous literature, Microsoft Azure and Google Cloud had the lowest WER among their peers [87,88]. Within our test on 30 samples, Microsoft Azure again had the lowest median WER of 4.1. ...
Thesis
Full-text available
Artificial Intelligence (AI) and related fields, based on decades of research, have seen a rise in recent years and are expected to have a significant impact on large parts of industry and society. Besides pure technical advancement, there are further social and economic factors, that influence the adoption of AI and need to be considered. There is an interest in detecting trends and analysing their impact from an industrial perspective, which concerns applications of AI but also from a societal and political perspective. In this cumulative dissertation, I demonstrate the existence of research gaps not only in the individual research areas of Web Mining (WM), trend detection, and AI adoption but specifically in the combination and parallel consideration of these areas within industrial and public sectors or organizational entities such as Small andMedium-sized Enterprises (SMEs). These identified gaps led to the following research objectives addressed in the contributed publications. The first objective is to investigate German industrial and political data, from five business magazines and two central political protocols of the German federal parliament, in trend detection and to investigate whether and how industry trends are reflected in political data sources. In the journal publication [P1] I led an effort to address this objective by proposing a WM approach that was utilized for the creation of a novel data set with business and political data sources comprising 1.07 million documents from 1998 to 2020. Using 246 identified AI-related buzzwords the AI trend could be visualized and its development further examined over more than two decades based on the relative occurrences and importance of the words. Within this study, the adoption of AI trends in German business and politics was investigated. The results showed the reflection of industrial trends in political data sources and revealed a faster adoption in business compared to politics. In the last years of the period under review, the analysis showed a notable increase in political discourse regarding the area of AI. In addition, specific and interpretable topics in business and political data sources could be identified using topic clustering. The second objective is to use the emerging digital medium of podcasts for trend detection and to propose a novel data-driven approach in WM based on podcasts. This objective was addressed in the journal article [P2] by examining the AI trend development in healthcare based on podcast data. For this purpose, I proposed a WM approach together with my co-authors, which was used to create a novel data set of more than 3,400 episodes of 29 healthcare podcasts from 2015 to 2021. Various glossaries and hype cycles were considered for the identification and extraction of 102 AI-related buzzwords. Again, these buzzwordswere used to successfully detect an AI trend and to analyze its development. In addition, interpretable topics in healthcare could be identified and examined over time based on the data source of podcasts. The detection of 14 topic clusters led to the visualization of the most dominant topics. The study outcome illustrated the transferability of the approach for trend detection besides AI and showed opportunities for future research using podcasts as a research medium. Finally, the third objective is to examine the AI adoption in industrial organizations based on SMEs in the healthcare sector and to conduct a cross-national comparison of China and Germany. The article [P3] addressed this objective by investigating the development and adoption of AI from the company’s perspective. I conducted together with my co-authors a multiple-case study collecting primary data from 14 SMEs equally distributed from both countries and gave insights regarding the perceived advantages and challenges of AI adoption among healthcare SMEs. Furthermore, the views in both countries on the expected future development of AI were presented and the results showed organizational requirements for AI implementation in healthcare. This cumulative dissertation presents methods to identify, collect and exploit data from new or less-considered sources in an industrial and political context. Different WM approaches including the pipeline are shown to address the challenges of large (public) databases as sources and enable other researchers to recreate the novel data sets as well as generate further insights beyond the described application domains and sectors. In addition, this doctoral thesis contributes to empirical research with a cross-national comparison of the use of AI in the German and Chinese healthcare industries, with a particular focus on the organizational group of SMEs, which has received less attention in the literature to date.
... This feature allows users to communicate with their devices using natural voice and language without the need to type or engage in complex physical interactions. As a result, Google Assistant has revolutionized the way people interact with their smartphones (Këpuska & Bohouta, 2017;Matarneh, 2017). ...
Article
Full-text available
This study focuses on the use of Google Assistant (GA) on Android smartphones as an implementation of the Speech Recognition System. Google Assistant enables voice communication with the device without physical interaction. The objective of this research is to analyze and thoroughly describe the pragmatic communication governed by the speech recognition system in the Google Assistant features on Android. A descriptive qualitative research method is employed, involving Android smartphone users as research subjects and the Google Assistant feature as the research object. Data collection is carried out through direct observation and recorded interactions, then analyzed using content analysis techniques based on Searle's pragmatics: directives, declaratives, assertives, commissives, and expressives. The results show that Google Assistant demonstrates good capability in providing contextual and informative responses to users' assertive, commissive, directive, declarative, and expressive speech acts. The significant potential of GA is revealed when users give specific and communicative instructions, eliciting precise, informative, and relevant responses.
... They concluded that it can help doctors produce patient discharge letters quickly without negatively affecting user satisfaction. Këpuska et al. [12] presented a comparative study between Microsoft API, Google API, and Sphinx-4. They evaluated these APIs on the selected audio files from TIMIT corpus and International Telecommunication Union. ...
... The literature generally focuses on chatbots dedicated to medicine (Dharwadkar and Deshpande 2018;Bulla et al. 2020;Athota et al. 2020;Gentner et al. 2020). (Këpuska and Bohouta 2017;Shakhovska et al. 2019). ...
Article
Full-text available
The article “Attitudes of Catholic clergies to the application of ChatGPT in unite religious communities” investigates the perspectives of the Catholic clergy on the integration of ChatGPT technology in religious environments. Bearing in mind that communication technologies are becoming an integral part of every aspect of life, including religious practices, the study delves into the potential, advantages, and challenges associated with using ChatGPT to support religious discourse. Adopting a qualitative approach, in-depth interviews were conducted with eleven Polish priests, addressing the diversity within the group. The respondents, who play important roles in organising the life of Catholic religious communities, highlighted ChatGPT’s potential in enhancing the dissemination of information, educational initiatives, and pastoral care. However, they also expressed concerns about the technology’s impact on genuine human interactions and the preservation of religious practices.
... When a person can interact with style or accent this technology permits him to use continuous voice and progressively verbal patterns. This system does not provide any necessity of prerequisite training [34]. ...
Article
For languages around the world, one of the most revolutionary and significant technology is the Speech Recognition, which helps in advancement of languages computing. Speech recognition systems for most of the languages around the world have either been already developed or under the process of development. Unlike other languages, the number of characters and the sounds in Sindhi language is more than the other languages, which makes it unique so most of the available approaches cannot be used for Sindhi language. Therefore, a totally different approach is required in correspondence of Sindhi sounds and characters for designing and development of the tools for Sindhi Speech Recognition. The literature review of Sindhi speech recognition reveals that there is a gap in the quality work and no SR system is available (to the best of our knowledge) with respect to software development for recognition of spoken Sindhi characters and sounds, which may be converted in Sindhi writing by the software. If the input is made through Sindhi speaking, it will not only help the people in reducing effort of typing of very difficult character sounds and characters of Sindhi but also be beneficial for developers who need to map the 52 characters of Sindhi Alphabets against 26 alphabets of English keyboard.
Preprint
Full-text available
Approaching Speech-to-Text and Automatic Speech Recognition problems in low-resource languages is notoriously challenging due to the scarcity of validated datasets and the diversity of dialects. Arabic, Russian, and Portuguese exemplify these difficulties, being low-resource languages due to the many dialects of these languages across different continents worldwide. Moreover, the variety of accents and pronunciations of such languages complicate ASR models' success. With the increasing popularity of Deep Learning and Transformers, acoustic models like the renowned Wav2Vec2 have achieved superior performance in the Speech Recognition field compared to state-of-the-art approaches. However, despite Wav2Vec2's improved efficiency over traditional methods, its performance significantly declines for under-represented languages, even though it requires significantly less labeled data. This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. To validate our framework's effectiveness, we conducted a detailed experimental evaluation using three datasets from Mozilla's Common Voice project in Arabic, Russian, and Portuguese. Additionally, the framework presented in this paper demonstrates robustness to different diacritics. Ultimately, our approach outperforms two previous baseline models, which are the pre-trained Wav2Vec2 and the well-known Whisper ASR model, resulting in an average relative improvement of 33.9\% in Word Error Rate and a 53.2\% relative improvement in Character Error Rate.
Article
Full-text available
This paper explores the development of interactive 3D human models that integrate conventional Artificial Intelligence (AI) techniques to improve user experience. With the growing demand for more immersive and responsive digital interactions, particularly in gaming, virtual reality, and augmented reality, it's crucial to employ AI in ways that make these experiences more realistic and engaging. This research examines existing AI methods for modeling, animating, and interacting with 3D human avatars, identifying their strengths and limitations. The study proposes a framework that leverages conventional AI techniques, such as machine learning, computer vision, and natural language processing, to enhance the interactivity of 3D human models. Through a series of experiments and case studies, we demonstrate how AI can be used to simulate human-like behaviors, improve motion realism, and enable intuitive interactions. Furthermore, the paper discusses the implications of these advancements for future applications, highlighting potential areas for innovation and ethical considerations. The findings suggest that integrating conventional AI into 3D human modeling can significantly enhance user engagement and open new avenues for technology use in entertainment, education, and simulation. The paper concludes with recommendations for researchers and developers aiming to create interactive 3D human models with advanced AI capabilities.
Chapter
Full-text available
In order to show the versatility of the approach taken in the whole word HMM scoring with WUW-SR paradigm the next set of experiments applied a phonetic modeling approach. The experiments rely on the Texas Instruments and Massachusetts Institute of Technology (TIMIT) corpus. TIMIT is a standard data set that is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of ASR systems (John S. Garofolo, 1993). TIMIT contains recordings of 630 speakers in 8 dialects of U.S. English. Each speaker is assigned 10 sentences to read that are carefully designed to contain a wide range of phonetic variability. Each utterance is recorded as a 16-bit waveform file sampled at 16 KHz. The entire data set is split into two portions: TRAIN to be used to generate an SR baseline, and TEST that should be unseen by the experiment until the final evaluation. The WUW-SR system developed in this work provides for efficient and highly accurate speaker independent recognitions at performance levels not achievable by current state of the art recognizers. Extensive testing demonstrates accuracy improvements superior by several orders of magnitude over the best known academic speech recognition system, HTK, as well as a leading commercial speech recognition system. Specifically, the WUW-SR system correctly detects the WUW with 99.98% accuracy. It correctly rejects non-WUW with over 99.99% accuracy. The WUW system makes 12 errors in 151615 words or less than 0.008%. Assuming speaking rate of 100 words per minute it would make 0.47 false acceptance errors perhour, or one false acceptance in 2.1 hours. Comparison of WUW performance in detection and recognition performance is 2525%, or 26 times better than HTK for the same training & testing data, and 2,450%, or 25 times better than Microsoft SAPI 5.1 recognizer. The out-of-vocabulary rejection performance is over 65,233%, or 653 times better than HTK, and 5900% to 42,900%, or 60 to 430 times better than the Microsoft SAPI 5.1 recognizer.
Article
Full-text available
Sphinx-4 is a flexible, modular and pluggable framework to help foster new innovations in the core research of hidden Markov model (HMM) speech recognition systems. The design of Sphinx-4 is based on patterns that have emerged from the design of past systems as well as new requirements based on areas that researchers currently want to explore. To exercise this framework, and to provide researchers with a "research-ready" system, Sphinx-4 also includes several implementations of both simple and state-of-the-art techniques. The framework and the implementations are all freely available via open source.
Comparing Open-Source Speech Recognition Toolkits. The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project
  • C Gaida
  • P Lange
  • R Petrick
  • P Proba
  • A Malatawy
  • D Suendermann-Oeft
C. Gaida, P. Lange, R. Petrick, P. Proba, A. Malatawy, and D. Suendermann-Oeft, Comparing Open-Source Speech Recognition Toolkits. The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project, 2011
Tuning Sphinx to Outperform Google's Speech Recognition API, The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project
  • P Lange
  • D Suendermann
P. Lange and D. Suendermann, Tuning Sphinx to Outperform Google's Speech Recognition API, The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project.
Sphinx for Speech Recognition. Department of telecommunication FEI STU
  • J Kačur
J. Kačur, HTK vs. Sphinx for Speech Recognition. Department of telecommunication FEI STU.
A Comparison of the Network Speech Recognition and Distributed Speech Recognition Systems and their eect on Speech Enabling Mobile Devices, doctoral diss
  • D Isaacs
  • D Mashao
D. Isaacs and D. Mashao, A Comparison of the Network Speech Recognition and Distributed Speech Recognition Systems and their eect on Speech Enabling Mobile Devices, doctoral diss. Speech Technology and Research Group, University of Cape Town, 2010
Automatic Pronunciation Evaluation and Mispronunciation Detection Using CMUSphin
  • R Srikanth
  • L Bo
  • J Salsman
R. Srikanth, L. Bo and J. Salsman, Automatic Pronunciation Evaluation and Mispronunciation Detection Using CMUSphin. COLING, 2012, 61-68
Google says its speech recognition technology now has only an 8% word error rate
  • V Beat
  • J Novet
V. Beat and J. Novet (2016) Google says its speech recognition technology now has only an 8% word error rate. Venture beat, http://venturebeat.com/2015/05/28/.
CMUSphinx Tutorial for Developers
  • Cmusphinx
CMUSphinx (2016) CMUSphinx Tutorial for Developers. Carnegie Mellon University, http://www.speech.cs.cmu.edu/sphinx/.
SRI International's Speech Technology and Research (STAR)
  • Star
STAR. (2016) SRI International's Speech Technology and Research (STAR)