Conference PaperPDF Available

Julius—An open source real-time large vocabulary recognition engine

Authors:

Abstract and Figures

Julius is a high-performance, two-pass LVCSR decoder for researchers and developers. Based on word 3-gram and context-dependent HMM, it can perform almost real- time decoding on most current PCs in 20k word dicta- tion task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word con- text dependency handling, enveloped beam search, Gaus- sian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be inde- pendent from model structures, and various HMM types are supported such as shared-state triphones and tied- mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit. The main platform is Linux and other Unix workstations, and partially works on Windows. Julius is distributed with open license to- gether with source codes, and has been used by many researchers and developers in Japan.
Content may be subject to copyright.
Eurospeech 2001 - Scandinavia
Julius — an Open Source Real-Time Large Vocabulary Recognition Engine
Akinobu Lee , Tatsuya Kawahara , Kiyohiro Shikano
Graduate School of Information Science, Nara Institute of Science and Technology, Japan
School of Informatics, Kyoto University, Japan
ri@is.aist-nara.ac.jp
Abstract
Julius is a high-performance, two-pass LVCSR decoder
for researchers and developers. Based on word 3-gram
and context-dependent HMM, it can perform almost real-
time decoding on most current PCs in 20k word dicta-
tion task. Major search techniques are fully incorporated
such as tree lexicon, N-gram factoring, cross-word con-
text dependency handling, enveloped beam search, Gaus-
sian pruning, Gaussian selection, etc. Besides search
efficiency, it is also modularized carefully to be inde-
pendent from model structures, and various HMM types
are supported such as shared-state triphones and tied-
mixture models, with any number of mixtures, states,
or phones. Standard formats are adopted to cope with
other free modeling toolkit. The main platform is Linux
and other Unix workstations, and partially works on
Windows. Julius is distributed with open license to-
gether with source codes, and has been used by many
researchers and developers in Japan.
1. Introduction
In recent study of large vocabulary continuous speech
recognition, we have recognized the necessity of a com-
mon platform. By collecting database and sharing tools,
the individual components such as language models and
acoustic models can be fairly evaluated through actual
comparison of recognition result. It benefits both re-
search of various component technologies of and devel-
opment of recognition systems.
The essential factors of a decoder to serve as a com-
mon platform is performance, modularity and availabil-
ity. Besides accuracy and speed, it has to deal with many
model types of any mixtures, states or phones. Also the
interface between engine and models needs to be sim-
ple and open so that anyone can build recognition sys-
tem using their own models. In order to modify or im-
prove the engine for a specific purpose, the engine needs
to be open to public. Although there are many recogni-
tion systems recently available for research purpose and
also as commercial products, not all engines satisfy these
requirement.
We developed a two-pass, real-time, open-source
large vocabulary continuous speech recognition engine
Figure 1: Screen shot of Japanese dictation system using
Julius.
named “Julius” for researchers and developers. It in-
corporates many search techniques to provide high level
search performance, and can deal with various types
models to be used for their cross evaluation. Standard
formats that other popular modeling tools[1][2] use is
adopted. On a 20k-word dictation task with word 3-gram
and triphone HMM, it realizes almost real-time process-
ing on most current PCs.
Julius has been developed and maintained as part of
free software toolkit for Japanese LVCSR[4] from 1997
on volunteer basis. The overall works are still continu-
ing to the Continuous Speech Recognition Consortium,
Japan[3]. This software is available for free with source
codes. A screen shot of Japanese dictation system is
shown in Figure 1.
In this paper, we describe its search algorithm, mod-
ule interface, implementation and performance. It can be
downloaded from the URL shown in the last section.
Eurospeech 2001 - Scandinavia
frame
synchronous
beam search
(1-best)
stack
decoding
search
(N-best)
input
speech word
sequence
word
trellis
index
word
2-gram word
3-gram
lexicon
Acoustic
Model
Language
Model
(cross word approx.) (no approx.)
context-dependent HMM
Julius
Figure 2: Overview of Julius.
2. Search Algorithm
Julius performs two-pass (forward-backward) search us-
ing word 2-gram and 3-gram on the respective passes[5].
The overview is illustrated in Figure 2. Many major
search techniques are incorporated. The details are de-
scribed below.
2.1. First Pass
On the first pass, a tree-structured lexicon assigned with
language model probabilities is applied with the frame-
synchronous beam search algorithm. It assigns pre-
computed 1-gram factoring values to the intermediate
nodes, and applies 2-gram probabilities at the word-end
nodes. Cross-word context dependency is handled with
approximation which applies the best model for the best
history. As the 1-gram factoring values are independent
of the preceding words, it can be computed statically in
a single tree lexicon and thus needs much less work area.
Although the values are theoretically not optimal to the
true 2-gram probability, these errors can be recovered on
the second pass.
We assume one-best approximation rather than word-
pair approximation. The degradation by the rough ap-
proximation in the first pass is recovered by the tree-
trellis search in the second pass. The word trellis index,
a set of survived word-end nodes, their scores and their
corresponding starting frames per frame, is adopted to
efficiently look up predicted word candidates and their
scores on the later pass. It allows recovery of word
boundary and scoring errors of the preliminary pass on
the later pass, thus enables fast approximate search with
little loss of accuracy.
2.2. Second Pass
In the second pass, 3-gram language model and pre-
cise cross-word context dependency is applied for re-
scoring. Search is performed in reverse direction, and
precise sentence-dependent N-best score is calculated by
connecting backward trellis in the result of the first pass.
The speech input is again scanned for re-scoring of cross-
word context dependency and connecting the backward
trellis. There is an option that applies cross-word con-
text dependent model to word-end phones without delay
for accurate decoding. We enhanced the stack-decoding
search by setting a maximum number of hypotheses of
every sentence length (envelope), since the simple best-
first search sometimes fails to get any recognition results.
The search is not A*-admissible because the second pass
may give better scores than the first pass. It means that
the first output candidate may not be the best one. Thus,
we compute several candidates by continuing the search
and sort them for the final output.
2.3. Gaussian Pruning and Gaussian Mixture Selec-
tion on Acoustic Computation
For efficient decoding with tied-mixture model that has a
large mixture pdfs, Gaussian pruning is implemented[6].
It prunes Gaussian distance (= log likelihood) compu-
tation halfway on the full vector dimension if it is not
promising. Using the already computed k-best values as a
threshold guarantees us to find the optimal ones but does
not eliminate computation so much [safe pruning]. We
implement more aggressive pruning methods by setting
up a beam width in the intermediate dimensions [beam
pruning]. We perform safe pruning in the standard de-
coding and beam pruning in the efficient decoding.
To further reduce the acoustic computation cost on
triphone model, a kind of Gaussian selection scheme
called Gaussian mixture selection is introduced[7]. Addi-
tional context-independent HMM with smaller mixtures
are used for pre-selection of triphone states. The state
likelihoods of the context-independent models are com-
puted first at each frame, and then only the triphone
states whose corresponding monophone states are ranked
within the k-best are computed. The unselected states are
given the probability of monophone itself. Compared to
the normal Gaussian selection that definitely select Gaus-
sian clusters by VQ codebook, the unselected states are
reliably backed-off by assigning actual likelihood instead
of some constant value, and realize stable recognition
with even more tight condition.
2.4. Transparent Word Handling
Toward recognition of spontaneous speech, transparent
word handling is also implemented for fillers. The N-
gram probabilities of transparent words are given as same
as other words, but they will be skipped from the word
Eurospeech 2001 - Scandinavia
context to prevent them from affecting occurrence of
neighbor words.
2.5. Alternative algorithms
Besides these algorithms described above, conventional
algorithms are also implemented for comparison. The
default algorithms described above such as 1-gram factor-
ing, one-best approximation and word trellis index can be
replaced to conventional 2-gram factoring, word-pair ap-
proximation and word graph interface respectively. They
are selectable on compile time and any combination is
allowed. Users can choose suitable algorithms for their
evaluation and development.
3. Module Interface
In order to act as a module of variousspeech recognition
systems, a recognition engine needs to have simple and
trivial interface to other modules. We adopt standard and
common format as module interface to keeps generality
and modularity to various models. The interfaces, speech
input and output of Julius is described in the following.
3.1. Acoustic Model
Monophone and triphone HMM with any number of mix-
tures, states, phones are supported. It can also handle
tied-mixture models and phonetic tied-mixture model[6].
The model types are automatically identified. The HMM
definition file should be in HTK format. When tied-
mixture model is given (by hmmdefs using directive
<TMix>), Gaussian pruning is activated for each mix-
ture pdfs. Not all formats in HTK hmmdefs are sup-
ported: multi input stream, discrete HMM, covariance
matrix other than diagonal and duration parameter are not
supported.
3.2. Lexicon
The format of the recognition dictionary is similar to the
HTK dictionary format. It is a list of words with their
output strings and pronunciations. Each pronunciation
is expressed as a sequence of sub-word unit name in the
acoustic model. Actually any sub-word unit like syllables
can be used. Multiple pronunciations of a word should
be written as separate words, each has possible pronunci-
ation pattern. With a triphone model, each pronunciation
unit is converted to context-aware form (ex. “a-k+i”) on
startup. To map possible (logical) triphones to the defined
(physical) ones in the hmmdefs, acoustic model should be
accompanied with HMM list that specifies the correspon-
dences.
The default maximum vocabulary size is set to 65535
words by default for memory efficiency, but larger size
can be allowed if configured so.
3.3. Language Model
Two language model is needed: word 2-gram for the first
pass and word 3-gram in reverse direction for the second
pass. ARPA-standard format can be directly read. To
speed up the startup procedure of recognition system, an
original binary format is supported1.
3.4. Input / Output
Speech input by waveform file (16bit PCM) or pattern
vector file (HTK format) is supported. Live microphone
input is also supported on Linux, FreeBSD, Sun and SGI
workstations. Input can also be sent via TCP/IP network
using DAT-Link/netaudio protocol. These input speech
stream can be segmented on pause by watching zero-
cross and power, and eachsegment is recognized sequen-
tially in turn.
Decoding of the first pass is done in parallel with the
speech input. It starts processing as soon as a input seg-
ment starts, and when a long pause is detected, it finishes
the first pass and continues to the second pass. As the sec-
ond pass finishes in very short time, the delay of recogni-
tion result output is sufficiently small.
The current speech analysis function of Julius can
extract only one coefficients for our acoustic models2.
CMN (cepstral mean normalization) is activated automat-
ically if acoustic model requires it. In case of file input,
cepstral mean is computed first for the file or segment.
For live input, average values of last 5 seconds are used.
Output is a recognition result in word sequence. N-
best results can be output. Phoneme sequence, log likeli-
hood scores and several search statistics can also be gen-
erated. Partial results can be output successively while
processing the first pass, though the final result is unde-
termined till the end of the second pass.
3.5. Search Parameters
Various search parameters can be determined in both
compile time and run time. The parameters of language
model weight and insertion penalty as well as the beam
width can be adjusted for the respective passes. Two de-
fault decoding options are also set up: Standard decod-
ing strictly handles context dependency of acoustic mod-
els for accurate recognition. Efficient decoding uses a
smaller beam width by default and terminated the search
when the first candidate is obtained.
4. Implementation and Distribution
Main platform is Linux, but it also works on other Unix
workstations: FreeBSD, Solaris2, IRIX6.3 and Digital
Unix. Live microphone input is supported on most OS. It
1A conversion tool “mkbingram” is included in the distribution
package.
226 dimension MFCC E D Z coefficients only.
Eurospeech 2001 - Scandinavia
Table 1: Performance of 20k Japanese dictation system
system efficient accurate
acoustic model PTM triphone
129x64 2000x16
Julius conf. fast standard
CPU time 1.3 xRT 5.0 xRT
Word acc. (male) 89.0 94.4
Word acc. (female) 91.8 95.6
Word acc. (GD avg.) 90.4 95.0
Word acc. (GID) 89.7 93.6
Julius-3.2, CPU: Pentium III 850MHz
(without Gaussian Mixture selection)
can also run on Microsoft Windows. We are now work-
ing on the Windows version to be fully functioned, but
currently the features are limited.
The whole source code is distributed freely. Preparing
acoustic model and language model, one can construct a
speech recognition system for their tasks. Users can mod-
ify the source or part of them for their applications with-
out explicit permission of the authors, both in research
purpose, and even in commercial purpose.
Julius has been developed since 1996. The recent
revision is 3.2 and development is still continuing on vol-
unteer basis. The source codes are written in C language,
and its total amount is about 22,000 lines and 604 Kbytes.
Although it has been developed and tested on Japanese
environment, it should work on other language with little
modification.
5. Evaluation on Japanese Dictation task
Performance of the total Japanese dictation system with
Julius and typical models provided by the IPA Japanese
dictation toolkit[4] is summarized in Table 1 for 20k sys-
tem. Two typical configuration are listed: efficiency-
oriented and accuracy-oriented. Note that the Gaussian
mixture selection and transparent word handling is not
included in this experiment.
The accurate version with triphone model and stan-
dard decoding achieves a word accuracy of 95%. The
efficient version using the PTM model keeps the accu-
racy above 90% and runs almost in real-time at a stan-
dard PC. The required memory is about 100Mbytes for
the efficient version and about 200Mbytes for the accu-
rate version. This difference comes mainly from acoustic
probability cache, as all state probabilities of all frame is
cached in the first pass to be accessed on the second pass.
The total system performance integrating Gaussian
mixture selection is shown in Table 2. Real-time factor
of 1.06 is achieved even with standard setting, and the
word accuracy reaches 92.1%.
Table 2: Total Performance
system efficient + GMS
acoustic model PTM 129x64
Julius conf. standard
CPU time 1.0 xRT
Word acc. (male) 90.7
Word acc. (female) 93.5
Word acc. (GD avg.) 92.1
Julius-3.2, CPU: Pentium III 850MHz
6. Conclusions
A two-pass, open-source large vocabulary continuous
speech recognition engine Julius has been introduced. It
has an ability to achieve word accuracy of 95% in accu-
rate setting, and over 90% in real-timeprocessing in 20k-
word dictation. It is well modularized with simple and
popular interface to be used as an assessment platform. It
provides total recognition facility with the current state-
of-the-art search techniques open to all researchers and
developers engaging in speech-related works.
It has been used by many researchers and developers
in Japan as a standard system. Future work will be dedi-
cated to further refinement of performance (especially in
memory usage), stability and more documentation. The
main WWW page is on the URL below:
http://winnie.kuis.kyoto-u.ac.jp/pub/julius/
7. Acknowledgment
Part of the work is sponsored by CREST (Core Research
for Evolutional Science and Technology), Japan.
8. References
[1] P.R. Clarkson and R. Rosenfeld: Statistical Language
Modeling Using the CMU-Cambridge Toolkit, In Proc.
of ESCA Eurospeech’97, vol.5, pages 2707–2710, 1997.
[2] S.Young, J.Jansen, J.Odell, D.Ollason and P.Woodland:
The HTK Book, In Entropic Cambridge Research Lab.,
1995.
[3] http://www.lang.astem.or.jp/CSRC/
[4] T.Kawahara, A.Lee, T.Kobayashi, K.Takeda,
N.Minematsu, S.Sagayama, K.Itou, A.Ito, M.Yamamoto,
A.Yamada, T.Utsuro, and K.Shikano: Free Software
Toolkit for Japanese Large Vocabulary Continuous
Speech Recognition, In Proc. ICSLP, Vol.4, pages
476–479, 2000.
[5] A.Lee, T.Kawahara and S.Doshita: An Efficient Two-
Pass Search Algorithm using Word Trellis Index, In Proc.
ICSLP, pages 1831–1834, 1998.
[6] A.Lee, T.Kawahara, K.Takeda and K.Shikano: A New
Phonetic Tied-Mixture Model for Efficient Decoding, In
Proc. IEEE-ICASSP, pages 1269–1272, 2000.
[7] A.Lee, T.Kawahara and K.Shikano: Gaussian Mixture
Selection using Context-independent HMM, In Proc.
IEEE-ICASSP, 2001.
... Microsoft offers a cloudbased Azure STT service trained on a large dataset corpus. Aside from the aforementioned STT models by the big tech companies, there are widely used representative speech recognition tool-kits such as Kaldi and Julius, which supports a wide range of speech recognition based on probabilistic models or neural networks and language processing module (Povey et al. (2011); Lee et al. (2001)). ...
Preprint
Full-text available
Automatic Speech Recognition (ASR) is a technology that converts spoken words into text, facilitating interaction between humans and machines. One of the most common applications of ASR is Speech-To-Text (STT) technology, which simplifies user workflows by transcribing spoken words into text. In the medical field, STT has the potential to significantly reduce the workload of clinicians who rely on typists to transcribe their voice recordings. However, developing an STT model for the medical domain is challenging due to the lack of sufficient speech and text datasets. To address this issue, we propose a medical-domain text correction method that modifies the output text of a general STT system using the Vision Language Pre-training (VLP) method. VLP combines textual and visual information to correct text based on image knowledge. Our extensive experiments demonstrate that the proposed method offers quantitatively and clinically significant improvements in STT performance in the medical field. We further show that multi-modal understanding of image and text information outperforms single-modal understanding using only text information.
... In our singing database, the label data consist of a lyric and musical score. The boundaries of all phonemes were manually aligned by an expert and the basic label data were generated using MeCab [31,32], NEologd [33,34] and Julius [35]. The boundaries were then aurally checked and manipulated by the expert. ...
Article
We have built a singing database that can be used for research purposes. Since recent songs are protected by copyright law, researchers typically use songs that can be used without copyright. With changes to the copyright law in Japan in 2019, we can now release a singing database consisting of songs protected by the law under several restrictions. Our database mainly consists of Japanese pop songs by a professional singer. We collected a total of 50 songs with around 57 minutes of vocals recorded in a studio. After recording, we labeled the phoneme boundaries and converted the songs into the MusicXML format required for the study of statistical parametric singing synthesis. Statistical analysis of the database was then carried out. First, we counted the number of phonemes to clarify their distribution. Second, we performed acoustical analysis on the distribution of pitch, the interval between notes, and duration. Results showed that although the information is biased, the amount of singing is sufficient in light of the findings of a prior study on singing synthesis. The corpus is freely available at our website, https://zunko.jp/kiridev/login.php [1].
... The language model data consists of roughly 11 M sentences. We use Julius [16] For developing, we use a corpus from German broadcast shows, which contains a mix of planned (i.e., read news) and spontaneous (i.e., talk shows) speech, for a total of 2 348 utterances (33 744 words). ...
... The spectrum of the impulse responseĤ(d, ω) was estimated for each utterance to be recognized. An open-source LVCSR decoder, Julius [6], based on word trigram and triphone context-dependent hidden Markov models (HMMs), was used. ...
... The proposed system's recognition performance was evaluated for a newspaper dictation task. We used Julius [6] as the decoder for large-vocabulary continuous speech recognition (LVCSR). The language model was a 20K-word trigram model, and the acoustic model was a phonetic tied-mixture (PTM) model with 64 mixtures and 3,000 states. ...
... Duration is the duration of an utterance, determined by voice activity detection of Julius [6] using amplitude and zero crossing. ...
Conference Paper
Full-text available
A phonetic tied-mixture (PTM) model for efficient large vocabulary continuous speech recognition is presented. It is synthesized from context-independent phone models with 64 mixture components per state by assigning different mixture weights according to the shared states of triphones. Mixtures are then re-estimated for optimization. The model achieves a word error rate of 7.0% with a 20000-word dictation of newspaper corpus, which is comparable to the best figure by the triphone of much higher resolutions. Compared with conventional PTMs that share Gaussians by all states, the proposed model is easily trained and reliably estimated. Furthermore, the model enables the decoder to perform efficient Gaussian pruning. It is found out that computing only two out of 64 components does not cause any loss of accuracy. Several methods for the pruning are proposed and compared, and the best one reduced the computation to about 20%
Article
Full-text available
We address a method to efficiently select Gaussian mixtures for fast acoustic likelihood computation. It makes use of context-independent models for selection and back-off of corresponding triphone models. Specifically, for the kbest phone models by the preliminary evaluation, triphone models of higher resolution are applied, and others are assigned likelihoods with the monophone models. This selection scheme assigns more reliable back-off likelihoods to the un-selected states than the conventional Gaussian selection based on a VQ codebook. It can also incorporate efficient Gaussian pruning at the preliminary evaluation, which offsets the increased size of the pre-selection model. Experimental results show that the proposed method achieves comparable performance as the standard Gaussian selection, and performs much better under aggressive pruning condition. Together with the phonetic tied-mixture (PTM) modeling, acoustic matching cost is reduced to almost 14% with little loss of accuracy.
Article
The CMU Statistical Language Modeling toolkit was released in 1994 in order to facilitate the construction and testing of bigram and trigram language models. It is currently in use in over 40 academic, government and industrial laboratories in over 12 countries. This paper presents a new version of the toolkit. We outline the conventional language modeling technology, as implemented in the toolkit, and describe the extra efficiency and functionality that the new toolkit provides as compared to previous software for this task. Finally,we give an example of the use of the toolkit in constructing and testing a simple language model.