ArticlePDF Available

Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

Authors:

Abstract and Figures

This article presents the original results of Polish language statistical analysis, based on the orthographic and phonemic language corpus. Phonemic language corpus for Polish was developed by using automatic grapheme-to-phoneme conversion of the source orthographic language corpus, obtained from the National Corpus of Polish (NCP). The corpus contains the most frequently used Polish words, written with the use of phonemic notation. Performed statistical analysis of Polish language based on phonemic language corpus, includes frequency of occurrence calculation of the orthographic and phonemic language components, as well as their sequence. Statistical language data, obtained as a result of performed statistical analysis, enable to develop statistical word-based and phoneme-based language models for Polish. Applying these language models can effectively contribute to efficiency improvement of automatic speech recognition for Polish.
This content is subject to copyright. Terms and conditions apply.
Kłosowski EURASIP Journal on Audio, Speech, and Music
Processing (2017) 2017:5
DOI 10.1186/s13636-017-0102-8
RESEARCH Open Access
Statistical analysis of orthographic and
phonemic language corpus for word-based
and phoneme-based Polish language
modelling
Piotr Kłosowski
Abstract
This article presents the original results of Polish language statistical analysis, based on the orthographic and
phonemic language corpus. Phonemic language corpus for Polish was developed by using automatic
grapheme-to-phoneme conversion of the source orthographic language corpus, obtained from the National Corpus
of Polish (NCP). The corpus contains the most frequently used Polish words, written with the use of phonemic
notation. Performed statistical analysis of Polish language based on phonemic language corpus, includes frequency of
occurrence calculation of the orthographic and phonemic language components, as well as their sequence. Statistical
language data, obtained as a result of performed statistical analysis, enable to develop statistical word-based and
phoneme-based language models for Polish. Applying these language models can effectively contribute to efficiency
improvement of automatic speech recognition for Polish.
Keywords: Automatic grapheme-to-phoneme conversion, Automatic speech recognition, Language corpus,
Language modelling, Language statistical analysis
Introduction
The main goal of automatic speech recognition (ASR) is
translation of spoken words into a text [1]. Modern speech
recognition systems require implementation of the acous-
tic and language modelling [2]. Both acoustic and lan-
guage modelling are important parts of modern statistical
speech recognition approach [3, 4]. Statistical language
modelling enables to develop large vocabulary and effec-
tive speech recognition systems [5]. Language modelling
can be used not only in speech recognition application,
but also in other areas of speech and language pro-
cessing, e.g., language recognition, machine translation,
part-of-speech tagging, parsing, handwriting recognition,
information retrieval and other applications.
The main motivation of the research on speech recog-
nition area, is to improve automatic speech recognition
process, especially for Polish language [6, 7]. Additionally,
Correspondence: pklosowski@polsl.pl
Department of Electronics, Faculty of Automatic Control, Electronics and
Computer Science, Silesian University of Technology, Akademicka 16, 44-100
Gliwice, Poland
research studies have been conducted in the field of prop-
erties of Polish phonemes [8, 9], speech recognition based
on it [10], speaker recognition [11, 12], speaker verifica-
tion [13–15], and new applications of speech recognition,
e.g., automatic speech translation [16].
Particularly, a good performance of automatic speech
recognition is achieved with use of speech recognition
by statistical methods [17]. Therefore, the main objective
of the research presented in this paper, was to perform
statistical analysis of Polish language based on the ortho-
graphic and phonemic language corpus, for development
of statistical word-based and phoneme-based language
models, as well as applying them to improve speech
recognition for Polish. The development of statistical lan-
guage models helps to predict a sequence of recognized
spoken words and phonemes. The use of developed lan-
guage models can effectively contribute to the improve-
ment of the automatic speech recognition effectiveness,
based on statistical methods. The development of word-
based and phoneme-based language models for speech
recognition, built on statistical language data, requires
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 2 of 16
the access to large orthographic and phonemic language
corpora [18, 19].
Orthographic language corpus
One of the biggest orthographic Polish language corpus is
the National Corpus of Polish (NCP) [20]. The NCP cor-
pus is available for the scientific community and offers
greatflexibility,aswellasitisextremelyimportantin
terms of scientific value. The NCP corpus provides crucial
reference material reflecting the state of contemporary
Polish language which meets all the requirements of mod-
ern science [21]. It can be used particularly by linguists,
but also by computer scientists interested in natural lan-
guage processing.
The NCP corpus contains over 1500 million of words.
The corpus is searchable by means of advanced tools,
developed by the Institute of Computer Science at the
Polish Academy of Sciences, which analyse Polish inflec-
tion and Polish sentence structure. The list of sources
for the NCP corpus, presented in Table 1, contains clas-
sic literature, daily newspapers, specialist periodicals and
journals, transcripts of conversations, and a variety of
short-lived and internet texts [22].
The results of the statistical analyses, presented in this
paper, can be considered as representative for Polish lan-
guage as a whole which is justified to a certain extent,
considering the corpus size. However, it is worth remem-
bering that the NCP corpus is still primarily based on
written texts. Spoken language transcripts constitute a
smaller percentage of the corpus contents which might
be still significant when it comes to certain specialized
continuous or conversational speech recognition tasks.
Table 2 presents the details of the orthographic language
corpus content, obtained from the NCP corpus resources.
Phonemic language corpus
Grapheme-to-phoneme conversion
The phonemic Polish language corpus contains words
written with the use of phonemic notation, obtained on
Table 1 Structure of the NCP coprus [20]
Type of a text source Percentage of the NCP corpus size
Daily newspapers 50.0%
Classic literature 16.0%
Non-fiction literature 5.5%
Specialized periodicals and journals 5.5%
Scientific and educational texts 2.0%
Other written texts 3.0%
Other books 1.0%
Transcripts of conversations 10.0%
Internet texts 7.0%
Table 2 Details of the orthographic language corpus content
No. Component type No. of unique No. of components
components in the corpus
1 single words 1,943,462 230,301,313
2 2-word sequences 75,395,184 246,110,034
3 3-word sequences 170,180,746 246,066,692
4 4-word sequences 217,586,930 246,023,356
5 5-word sequences 232,439,967 245,980,021
the basis of automatic grapheme-to-phoneme conversion
of an orthographic text. Automatic processing of a natural
language, very often requires the implementation of auto-
matic grapheme-to-phoneme conversion. Grapheme-to-
phoneme conversion determines phonemic transcriptions
directly from orthographic representations [23].
Phonemes are usually written with specially designed
alphabets. The most commonly used alphabet for this pur-
pose is the International Phonetic Alphabet (IPA) [24].
It was created on the basis of phonetics and phonology
of West-European languages, and it is not satisfactorily
adapted into Polish. For Polish, like other Slavic languages,
a special transcriptional system, called the Slavistic Pho-
netic Alphabet (SPA), is most frequently used [25]. The
second very often used phonetic alphabet is the Speech
Assessment Methods Phonetic Alphabet (SAMPA) [26].
SAMPA is a machine-readable phonetic alphabet, using 7-
bit printable ASCII characters, based on the IPA alphabet.
Table 3 presents a set of Polish phonemes and the exam-
ples of their occurrence in Polish, written with the use of
the SPA, IPA, and SAMPA phonetic alphabets.
Knowledge-based grapheme-to-phoneme approaches,
unlike data-driven G2P approaches, exploit rules, created
by humans or deriving from linguistic studies to con-
vert the sequence of graphemes in a word to a sequence
of phonemes [27]. Rule-based grapheme-to-phoneme
approaches are typically formulated in the framework
of finite state automata, and require the formulation of
grapheme-to-phoneme conversion rules [28]. The largest
contribution to solve the problem of automatic grapheme-
to-phoneme conversion for Polish, were the publications
of Maria Steffen-Batóg [29, 30].
Automatic grapheme-to-phoneme conversion process
can be described as an Ffunction, defined by the following
formula:
F(α) =β(1)
where:
α=α1...α
k...α
aαkX(1ka)(2)
β=β1...β
k...β
bβkY(1kb)(3)
and where ais the length of orthographic character
sequence, bis the length of phonemic character sequence,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 3 of 16
Table 3 A set of Polish phonemes and examples of their
occurrence
No.
Phonetic alphabet symbols Example of occurrence in Polish
[SPA] [IPA] [SAMPA]
1[e][ε][e] serce
2 [a] [A][a] baba
3[o][
O][o] oko
4 [t] [t] [t] trawa
5[n][n][n] noc
6[y][
1][I] syty
7[i
][j] [j] jajo
8[i][i][i] wici
9 [r] [r] [r] rok
10 [s] [s] [s] sok
11 [v] [v] [v] wada
12 [p] [p] [p] praca
13 [u] [u] [u] buk
14 [m] [m] [m] mama
15 [k] [k] [k] kot
16 [´
n] [ñ][n’] ko ´
n
17 [d] [d] [d] dudek
18 [l] [l] [l] lato
19 [ë][w] [w] łysy
20 [š] [S][S] szyszka
21 [f] [f] [f] fala
22 [z] [z] [z] koza
23 [c] [>
ţ][ts] cacko
24 [b] [b] [b] baba
25 [g] [g] [g] godło
26 [´
s] [C][s’] siano
27 [´
c] [>
tC ][ts’] ciasto
28 [´
G][J][x] higiena
29 [ˇ
c] [>
Ù][tS] czarny
30 [ž] [Z][Z] ka˙
zdy
31 [N][N][e]eka
32 [´
k] [c] [k’] kino
33 [´
Z][
>
][dz’] dziedzic
34 [Z][
>
dz ][dz] nadzy
35 [´z] [ý][z’] ziarno
36 [´
g] [Í][g’] magiczny
37 [ˇ
Z][
>
Ã][dZ] dro˙
zd˙
ze
Xis the set of the orthographical alphabet characters
in Polish, additionally with special characters, and Yis
the set of the phonemic characters alphabet in Polish,
described by the Slavistic Phonetic Alphabet:
X={a, ˛a, b, c, ´
c, d, e, ˛e, f, g, h, i, j, k, l, ł, m,
n, ´
n, o, ó, p, r, s, ´
s, t, u, w, y, z, ´
z, ˙
z, q, v, x,
.,?,!,,,:,;,-,(,),#,/}(4)
Y={i, y, e, a, o, u, i
,u
,r,l,m,n,´
n,
N,f,v,s,z,´
s, ´
z, X,p,b,t,
d, k, g, ´
k, ´
g, c, Z,ˇ
c, ˇ
Z,´
c, ´
Z}(5)
Grapheme-to-phoneme conversion of correctly written
orthographic texts in Polish is transformation of words
written in the orthographic Xalphabet to form written
in the phonemic alphabet Y. Automatic grapheme-to-
phoneme conversion Ffunction can be delineated by a set
of formal grapheme-to-phoneme conversion rules defin-
ing how each αword, constructed from the orthographic
Xalphabet, can be transformed into a new βword con-
structed from the phonemic alphabet defined by the Y
set. The rules usually are numerous with varying degrees
of complexity. The size and complexity of grapheme-to-
phoneme conversion rules depend on the number of let-
ters in the orthographical alphabet and the fact that each
letter can be pronounced differently in various contexts.
A set of grapheme-to-phoneme conversion rules for
Polish was developed by Maria Steffen-Batóg and it
was presented in the monograph dedicated to the auto-
matic grapheme-to-phoneme conversion of texts in Polish
[29, 30]. Knowledge included into these monographs was
essential in developing implementation of the automatic
grapheme-to-phoneme conversion algorithm for Polish.
According to Maria Steffen-Batóg, all grapheme-to-
phoneme conversion rules, relating to one orthographic
letter, can be stored in one table, called grapheme-to-
phoneme conversion rules table for one letter.
According to the grapheme-to-phoneme conversion
rules for Polish, described in the literature [29–32],
the grapheme-to-phoneme conversion for Polish has
been implemented in the Python programming lan-
guage, as automatic grapheme-to-phoneme conversion
application named TransFon [33]. The implementation
includes 975 grapheme-to-phoneme conversion rules for
35 orthographic letters in Polish, additionally conversion
rules for special characters and automatic grapheme-to-
phoneme conversion algorithm [33]. Block diagram of the
grapheme-to-phoneme conversion algorithm for a single
orthographic word is presented in Fig. 1. Due to that,
many words have multiple variants of the correct pronun-
ciation and the implementation includes only the most
common basic variant of the pronunciation. Implemen-
tation of additional pronunciation variants is planned in
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 4 of 16
Fig. 1 Block diagram of the grapheme-to-phoneme conversion
algorithm for a single orthographic word
the future. The problem of foreign words and acronyms
phonemic transcription have been solved by using the dic-
tionary where phonemic transcription of foreign words
and acronyms have been defined.
TransFon application was developed entirely, with-
out adapting any existing similar tools. The developed
grapheme-to-phoneme conversion implementation is not
the only one for Polish language [34–38], but only the one
of them is available for free use [38]. The implementation
of grapheme-to-phoneme conversion allows to apply it
to any task (e.g., phonemic language corpus development
for Polish).
Table 4 presents the phonemic transcription examples
in Polish, written with the use of the SPA, IPA, and
SAMPA phonetic alphabets [25, 26].
The TransFon application enables to create the phone-
mic language corpus only on the basis of the orthographic
source corpus. After automatic grapheme-to-phoneme
conversion of the orthographic corpus with the use
TransFon application, phonemic language corpus for
Polish was obtained, in order to perform statistical
analysis of Polish language.
Evaluation of grapheme-to-phoneme conversion
implementation
The evaluation of the automatic grapheme-to-phoneme
conversion implementation is crucial. During implemen-
tation of automatic grapheme-to-phoneme conversion for
Polish, it was necessary to check and to prove if it works
properly.
Thetestprocedure for automatic grapheme-to-phoneme
conversion implementation consisted of:
Performing the test automatic grapheme-to-phoneme
conversion of orthographic text corpus file containing
the most frequently used 1,943,462 unique words in
Polish, obtained from the National Corpus of Polish
resources [20].
In case of doubt, validation and verification of
automatic grapheme-to-phoneme conversion results
for words with the use of Polish language dictionary
available online, with specifying correct
pronunciation of words in Polish [39].
Registering cases of incorrect automatic grapheme-to-
phoneme conversion, conversion errors and other
encountered problems.
The automatic phonemic transcription application was
implemented in such way, that the conversion algorithm
was stopped, if grapheme-to-phoneme conversion prob-
lem occurred (e.g., when there was no rule allowing for
a correct phonemic transcription). This solution makes
it easier to work on improving and developing the auto-
matic grapheme-to-phoneme conversion application. In
addition, any doubts about the correct pronunciation was
solved with help of wiktionary.org service [39]. This solu-
tion obviously has some serious limitations. The dictio-
nary of wiktionary.org service contains only 61,141 Polish
words and only in their basic form. The verification was
further complicated by other problems such as differ-
ent variants of the correct pronunciation of words or
pronunciation of foreign words in the corpus.
The causes of problems and errors in automatic
grapheme-to-phoneme conversion operation were as fol-
lows:
errors in the implementation of the
grapheme-to-phoneme conversion algorithm and
conversion rules,
missing grapheme-to-phoneme conversion rules in
the tables (i.e., rules not included in the tables) for
some orthographic letters contexts,
grapheme-to-phoneme conversion issue of foreign
words, acronyms and words, which are not present in
Polish language dictionary.
The above problems were solved in the following way:
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 5 of 16
Table 4 Phonemic transcription examples in Polish
No. Orthographic text Phonemic transcription Phonemic transcription Phonemic transcription
[SPA] [IPA] [SAMPA]
1z˛ab [zomp] [zomp] [zomp]
2´
slub [´
slup] [´
slup] [s’lup]
3 wkr˛ety [fkrenty] [fkrEnty] [fkrentI]
4bie
˙
znia [bi
´
na] [bjEZña] [bjeZn’a]
5 wszystkie [fšyst´
ke] [fS1stcE][fSIstk’e]
6 natomiast [natomi
ast] [natomjast] [natomjast]
7 przypadku [pšypatku] [pS1patku] [pSIpatku]
8 najbardziej [nai
bar´
Zei
][najbar
>
ej] [najbardz’ej]
9 oczywi
´
scie [oˇ
cyvi´
s´
ce] [o
>
Ù1viC>
tC E ][otSIvis’ts’e]
10 powiedział [povi
e´
Zau
][povje
>
aw] [povjedz’aw]
The errors in the implementation of the
grapheme-to-phoneme conversion algorithm and in
conversion rules tables have been corrected by
modifications, made within an application source
code in Python programming language.
The problem of missing grapheme-to-phoneme
conversion rules in tables has been solved by adding
new conversion rules to the existing tables. In order
to complete the missing grapheme-to-phoneme
conversion rules, new conversion rules were
supplemented for the following orthographic letters
“i”, “n”, “d”, “z”, “z”, “c”, “f”, “s”, in some contexts.
The problems of foreign words and acronyms, have
been solved by using the dictionary, where phonemic
transcription of foreign words and acronyms have
been defined. As a result, rule-based automatic
grapheme-to-phoneme conversion was
complemented by dictionary-based automatic
grapheme-to-phoneme conversion method.
A number of improvements made it possible to increase
effectiveness of the grapheme-to-phoneme conversion
implementation. Tables 5 and 6 present the word error
Table 5 WER values of the developed G2P conversion
implementation, before improvements
No. Parameter Value
1 No. of checked unique words 1,943,462
2 No. of G2P conversion errors for unique words 33,638
3 WER value for unique words in % 1.731
4 No. of words in the corpus 230,301,313
5 No. of G2P conversion errors for words in corpus 3,707,890
6 WER value for corpus in % 1.610
rate (WER) values of grapheme-to-phoneme conversion
implementation, before and after improvements.
The WER value for 1,943,462 checked unique words,
was equal 0.387%. The WER value for corpus contains
230,301,313 words, was equal 0.030%. The changes of
WER values, before and after improvements, testify to the
fact that implemented modifications have contributed to
improving the effectiveness of G2P conversion.
The developed phonemic language corpus for Polish
The phonemic language corpus for Polish was developed
by automatic grapheme-to-phoneme conversion of the
source orthographic language corpus file obtained from
the NCP corpus resources.
Table 7 presents the details of the phonemic language
corpus content.
The phonemic language corpus contains the list of
1, 943, 462 Polish words written orthographically, their
phonemic transcription written with the SAMPA phone-
mic alphabet and additionally, the number of word occur-
rence in the NCP balanced corpus. The measure of the
NCP balanced corpus size is the sum of all numbers of the
word occurrences, which is equal to 230,301,313 words.
Table 6 WER values of the developed G2P conversion
implementation, after improvements
No. Parameter Value
1 No. of checked unique words 1,943,462
2 No. of G2P conversion errors for unique words 7525
3 WER value for unique words in % 0.387
4 No. of words in the corpus 230,301,313
5 No. of G2P conversion errors for words in corpus 69,802
6 WER value for words in the corpus in % 0.030
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 6 of 16
Table 7 Details of the phonemic language corpus content
No. Component type No. of unique No. of components
components in the corpus
1 single phonemes 37 1,263,248,497
2 2-phoneme sequences 1096 1,032,922,921
3 3-phoneme sequences 17,340 823,393,519
4 4-phoneme sequences 128,766 644,597,673
5 5-phoneme sequences 402,529 483,987,550
A sample section of the developed phonemic language
corpus for Polish is presented in Table 8. It should also be
noted that the standard SAMPA for Polish includes sev-
eral sequences of phonemic transcription labels that may
cause ambiguity unless separated by spaces or other char-
acters. To avoid this problem, all phonemes are separated
by square brackets.
Analysis of the obtained results and discussion
Statistical analysis of the orthographic and phonemic
language corpora
With the use of the orthographic and phonemic language
corpora, it was possible to perform statistical analysis of
Polish language which includes calculation of the follow-
ing distributions:
the frequency of the single orthographic word
occurrence,
the frequency of the
n
-word sequence occurrence for
n=2, ...,5,
the frequency of the phoneme occurrence,
the frequency of the
n
-phoneme sequence
occurrence for n=2, ...,5.
The frequency distribution of words in the orthographic
language corpus, is presented in Fig. 2.
A sample calculated frequency of word occurrence, is
presented in Table 9, where 1% corresponds to about
2303013 occurrences.
A sample calculated frequency of occurrence for the
two-word and the three-word sequences, are presented in
Tables 10 and 11. The results for the four-word and the
five-word sequences, are not presented in this paper, but
they can also be helpful to develop advanced word-based
language models.
The frequency distribution of the phonemes in the
phonemic language corpus, is presented in Fig. 3.
The frequency distributions of the n-phoneme
sequences, for n=2, ...,5,arepresented inFig. 4.
Evaluation of the obtained results
The results of the research on statistical analysis of Polish
language, performed with the phonemic language corpus,
were compared to other results published in the literature
Table 8 A sample section of the developed phonemic language
corpus for Polish
Number of occurr. Orthographic word Phonemic transcription
of 230301313 of word
iC(wi)wi[SAMPA]
1 7,692,997 w [f]
2 5,333,210 i [i]
3 4,235,003 na [n][a]
4 4,158,902 z [s]
5 3,981,525 sie
˛[s’][e]
6 3,601,719 nie [n’][e]
7 2,904,114 do [d][o]
8 2,205,896 ˙
ze [Z][e]
9 2,171,877 to [t][o]
10 1,731,304 o [o]
11 1,728,527 jest [j][e][s][t]
12 1,425,793 a [a]
13 1,003,027 jak [j][a][k]
14 983,395 po [p][o]
15 912,660 od [o][t]
16 877,522 ale [a][l][e]
17 847,373 za [z][a]
18 775,006 przez [p][S][e][s]
19 754,024 co [ts][o]
20 663,771 dla [d][l][a]
21 645,573 czy [tS][I]
22 610,035 tym [t][I][m]
23 607,673 ju˙
z[j][u][S]
24 544,343 tak [t][a][k]
25 534,509 tylko [t][I][l][k][o]
26 500,801 ma [m][a]
27 475,172 mo˙
ze [m][o][Z][e]
28 451,225 tego [t][e][g][o]
29 445,705 ze [z][e]
30 426,201 jego [j][e][g][o]
··· ··· ···
[40–46]. Summary comparisons of the obtained statistical
language data, to other results, available in the literature,
are presented in Tables:
Table 12 presents the occurrence frequency of Polish
phonemes and comparison to the results published in
the literature [40, 42, 44, 45],
Table 13 presents the occurrence frequency of the
two-phoneme sequences (diphones) in Polish and
comparison to the results published in the
literature [45],
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 7 of 16
100102104106108
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
word rank
Frequency of occurrence
words
Fig. 2 Frequency distribution of the word occurrence in Polish
language
Table 14 presents the occurrence frequency of the
three-phoneme sequences (triphones) in Polish and
comparison to the results published in the
literature [45].
The reasons of differences among the obtained results
of the language statistical analysis performed by other sci-
entists may be: differences in used corpora (e.g., in size,
quality, linguistic structure) and development of language
and changes over time. Language is constantly chang-
ing, evolving, and adapting to the needs of its speakers.
All languages change continually, and do so in many and
varied ways (e.g., lexical changes, phonetic and phono-
logical changes, spelling changes, semantic and syntactic
changes) [47]. Therefore, a results of research performed
using different corpora may be very different from each
other [48, 49]. The most similar results apply statisti-
cal analysis of Polish phonemes occurrence presented in
Table 12 [44, 45]. The least accurate results were obtained
with much smaller language corpus a few decades ago
[40–42]. Taking into account the results, available in the
literature, it can be concluded that performed statistical
analysis of Polish language, was extensive. No results of
a statistical analysis of the n-phoneme sequences occur-
rence in Polish for n>3 were found in the literature.
On the basis of the comparison results, the following con-
clusion can be drawn: The developed phonemic language
corpus in Polish, which was used to perform statistical
analysis of Polish language, was very huge, containing
1263248497 phonemes, but not the biggest developed
for Polish language [44]. The statistical analysis results
obtained based on it, allow to develop statistical models of
Polish language.
Table 9 Frequency of the word occurrence in the orthographic
corpus file
No. Frequency of occurrence Word
if(wi)·100 [%] wi
1 3.34041 w
2 2.31575 i
3 1.83890 na
4 1.80585 z
5 1.72883 sie
˛
6 1.56392 nie
7 1.26101 do
8 0.95783 ˙
ze
9 0.94306 to
10 0.75176 o
11 0.75055 jest
12 0.61910 a
13 0.43553 jak
14 0.42700 po
15 0.39629 od
16 0.38103 ale
17 0.36794 za
18 0.33652 przez
19 0.32741 co
20 0.28822 dla
21 0.28032 czy
22 0.26489 tym
23 0.26386 ju˙
z
24 0.23640 sa
˛
25 0.23636 tak
26 0.23209 tylko
27 0.21745 ma
28 0.20633 mo˙
ze
29 0.19593 tego
30 0.19353 ze
··· ··· ···
Frequency of the word occurrence
The frequency of word occurrence in a language is well
described by Zipf’s law [50, 51]:
Zr=a
rb(6)
where Zris the frequency of the word ranked r,wherer
is the rank of the word if frequencies are ranked from the
most frequent (r=1)to the least frequent (r=n),anda
and bare parameters to be estimated from obtained statis-
tical data. The usual findings is that bis close to 1 [50]. The
fit of Zipf ’s equation to the ranked frequency distribution
of Polish words is presented in Fig. 5.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 8 of 16
Table 10 Frequency of the two-word sequence occurrence in
the orthographic corpus file
No. Frequency of occurrence 2-word sequence
f(wi1,wi)·100 [%] wi1wi
1 0.12643 sie
˛w
2 0.10243 w tym
3 0.08805 sie
˛na
4 0.08287 sie
˛z
5 0.07604 sie
˛do
6 0.06529 nie ma
7 0.05799 nie jest
8 0.05597 sie
˛,˙
ze
9 0.04932 w tej
10 0.04741 jest to
11 0.04602 ˙
ze w
12 0.04595 ˙
ze nie
13 0.04209 i w
14 0.04209 nie tylko
15 0.04118 to nie
16 0.04083 i nie
17 0.03882 to jest
18 0.03698 sie
˛nie
19 0.03145 to, ˙
ze
20 0.02843 ˙
ze to
21 0.02751 ale nie
22 0.02725 przede wszystkim
23 0.02718 w Polsce
24 0.02675 a w
25 0.02671 a nie
26 0.02668 jest w
27 0.02640 po prostu
28 0.02629 w którym
29 0.02599 jak i
30 0.02543 nie było
··· ··· ···
The ranked frequency distribution of Polish words was
estimated by Zipf ’s equation in the following form:
Zr=0.041566
r0.9 (7)
The average fit of Zipfs equation to the ranked fre-
quency distribution of Polish words was measured by
the coefficient of determination R2value. The coeffi-
cient of determination for fit of Zipfs equation, presented
in Equation (7), to the ranked frequency distribution of
Polish words is equal:
R2=0.90729 (8)
Table 11 Frequency of the three-word sequence occurrence in
the orthographic corpus file
No. Frequency of occurrence 3-word sequence
f(wi2,wi1,wi)·100 [%] wi2wi1wi
1 0.01890 w zwia
˛zku z
2 0.01325 w tym roku
3 0.01238 ze wzgle
˛du na
4 0.01195 w ten sposób
5 0.00921 na to, ˙
ze
6 0.00920 okazało sie
˛,˙
ze
7 0.00915 w tej chwili
8 0.00902 o tym, ˙
ze
9 0.00892 po raz pierwszy
10 0.00891 w stosunku do
11 0.00871 do tej pory
12 0.00791 w tej sprawie
13 0.00786 je´
sli chodzi o
14 0.00728 zwia
˛zku z tym
15 0.00665 to nie jest
16 0.00615 nie jest to
17 0.00588 o których mowa
18 0.00582 których mowa w
19 0.00566 ˙
ze jest to
20 0.00529 w tym czasie
21 0.00522 w tym samym
22 0.00505 nie moze by´
c
23 0.00498 w ogóle nie
24 0.00498 mi sie
˛,˙
ze
25 0.00473 ˙
ze nie ma
26 0.00472 nie da sie
˛
27 0.00442 w ubiegłym roku
28 0.00440 mam nadzieje
˛,˙
ze
29 0.00421 w zalezno´
sci od
30 0.00410 na tym, ˙
ze
··· ··· ···
Additionally, root-mean-square error RMSE value was
calculated for this case and it is equal:
RMSE =7.6475 ·106(9)
The R2value indicates how well statistical data fit into
a statistical model. The R2value equals R2=0.90729
indicates that the Zipfs equation fits well to the obtained
statistical data of the word occurrence frequency in Polish
language.
On this basis and on the basis of the results available in
the literature [51–53], it can be concluded that the statis-
tical data, obtained as the result of performed statistical
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 9 of 16
100101102
10−4
10−3
10−2
10−1
phoneme rank
Frequency of occurrence
phonemes
Fig. 3 Frequency distribution of the phoneme occurrence in Polish
language
analysis of Polish language, based on the orthographic
language corpus, are correct.
Frequency of the phoneme and n-phoneme sequence
occurrence
The frequency of word occurrence in a language is well
described by Zipf’s law [50]. However, Zipf’s law does
not describe well the distribution of the phonemes and
phoneme sequences out of which words are composed.
The examination of occurrence frequency in 95 languages,
presented in the literature [51], shows that phoneme fre-
quencies are best described by an equation first devel-
oped by Yule, that also describes the distribution of
DNA codons [54]. The frequency of the phoneme occur-
rence in a language is described well by Yule’s equation
formula [51]:
100101102103104105106
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
n−phoneme sequence rank
Frequency of occurrence
2−phoneme sequences
3−phoneme sequences
4−phoneme sequences
5−phoneme sequences
Fig. 4 Frequency distribution of the n-phoneme sequence
occurrence in Polish language, for n=2, ...,5
Yr=a
rb·cr(10)
where Yris the frequency of the phoneme ranked r,andr
istherankofthephonemeiffrequenciesarerankedfrom
the most frequent (r=1)to the least frequent (r=n),
and a,band care parameters to be estimated from the
obtained statistical data.
The fits of Zipf ’s and Yule’s equations to the ranked
frequency distribution of Polish phonemes are presented
in Fig. 6.
The evaluation results of the fits of Zipf ’s and Yule’s
equations to the ranked frequency distribution of Polish
phonemes are presented in Table 15.
Note that the Zipf’s equation is a special case of the
Yule’s equation in which cris neglected. It is not always
possible to neglect this term. As shown in Fig. 6 and
in Table 15, the Yule’s equation fits to the distribution
of the phoneme frequencies in Polish much better than
the Zipfs equation. It is not an isolated case and similar
regularity can be observed in other languages [51].
The same regularity was observed for frequency distri-
butions of the n-phoneme sequence occurrence for Polish
language, for n=2, ..., 5. The Figs. 7 and 8 present the fit
of Yule’s equation to the ranked frequency distribution of
Polish n-phoneme sequences for n=2andn=3.
The summary of evaluation results of the Yule’s
equation fits to the ranked frequency distribution of
Polish phonemes and the n-phoneme sequences for n=
2, ..., 5 are presented in Table 16.
The values of R2, presented in Table 16, indicate that
the Yule’s equation fits very well to the obtained statisti-
cal data of frequency occurrence of Polish phonemes and
the n-phoneme sequences for n=2, ..., 5. A similar prop-
erties are observed for other languages. On the basis of
the obtained results and the results available in the litera-
ture [40, 41, 43–46, 51], it can be concluded that statistical
data, obtained as the result of performed statistical anal-
ysis of Polish language, based on the orthographic and
phonemic language corpora, are correct.
Example of practical application of the obtained
results for language modelling
This article contains a general statistics of Polish language
that can be useful for a variety of language and speech pro-
cessing applications, including automatic speech recogni-
tion with language models [55].
The goal of the word-based language model, is to model
the sequence of words in the context of the task, being
performed by the speech recognition system. In continu-
ous speech recognition, the incorporation of the language
model is crucial to reduce the search speed of recognized
words sequence W. The probability P(W)of occurrence
W, sequence of nwords wi, can be decomposed as [17]:
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 10 of 16
Table 12 Frequency of Polish phoneme occurrence—comparison to the results published in the literature [40, 42, 44, 45]
No. Obtained results Phoneme [SAMPA] Results in [44] Results in [45] Results in [42] Results in [40]
if(qi)[%] [ qi]f(qi)[%] f(qi)[%] f(qi)[%] f(qi)[%]
1 9.59478 [e] 7.882 9.108 10.6 10.2
2 9.55135 [a] 8.141 9.584 9.7 9.3
3 9.20001 [o] 7.646 8.994 8.0 9.1
4 4.65630 [t] 3.708 4.489 4.8 4.4
5 4.34100 [n] 3.665 4.443 4.0 4.0
6 4.13142 [I] 3.174 3.648 3.8 4.1
7 4.09810 [j] 3.299 3.796 4.4 4.5
8 4.00623 [i] 3.620 4.359 3.4 3.9
9 3.75265 [r] 3.705 4.674 3.2 3.6
10 3.73464 [s] 2.927 3.638 2.8 3.0
11 3.49063 [v] 3.137 3.782 2.9 3.5
12 3.41265 [p] 2.759 3.263 3.0 3.1
13 3.32576 [u] 2.774 3.345 2.8 3.4
14 3.16465 [m] 2.626 2.988 3.2 3.5
15 2.96802 [k] 2.418 2.976 2.5 2.7
16 2.55419 [n’] 1.840 2.088 2.4 2.6
17 2.29278 [d] 2.391 2.888 2.1 2.2
18 2.22555 [l] 2.164 2.642 1.9 2.1
19 1.93507 [w] 1.626 1.636 1.8 2.2
20 1.70517 [S] 1.118 1.215 1.9 2.0
21 1.69430 [f] 1.363 1.683 1.3 1.5
22 1.60077 [z] 1.665 1.947 1.5 1.8
23 1.46934 [ts] 1.335 1.692 1.2 1.5
24 1.46097 [b] 1.304 1.497 1.5 1.5
25 1.33050 [g] 1.341 1.547 1.3 1.5
26 1.31409 [s’] 0.927 0.965 1.6 1.5
27 1.16326 [ts’] 0.643 0.662 1.2 1.3
28 1.12532 [x] 1.153 1.427 1.0 1.1
29 1.11761 [tS] 0.831 0.955 1.2 1.2
30 1.10377 [Z] 0.884 0.944 1.3 1.2
31 0.79984 [e]0.582 0.673 0.6 0.7
32 0.65927 [k’] 0.570 0.698 0.7 n.a.
33 0.53682 [dz’] 0.538 0.554 0.7 0.8
34 0.20125 [dz] 0.227 0.261 0.2 0.2
35 0.14815 [z’] 0.183 0.195 0.2 0.2
36 0.10971 [g’] 0.198 0.260 0.1 n.a.
37 0.02412 [dZ] 0.037 0.040 0.1 0.0
P(W)=P(w1)
n
i=2
P(wi|w1,...,wi1)(11)
where P(wi|w1,...,wi1)is the conditional probability
that wiwill occur, given the previous word sequence
w1,...,wi1. Unfortunately, it is impossible to compute
the conditional word probabilities P(wi|w1,...,wi1)for
all words and all sequence lengths in a given language.
Even though the sequences are limited to moderate val-
ues of i, there would not be enough data to estimate
reliably all of the conditional probabilities. The condi-
tional probability can be approximated by estimating the
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 11 of 16
Table 13 Frequency of the two-phoneme sequence occurrence
in Polish—comparison to the results published in the
literature [45]
No. Obtained results Diphone [SAMPA] Results in [45]
f(qi1,qi)[%] [ qi1][qi]f(qi1,qi)[%]
1 2.09086 [j][e] 1.7253
2 1.48817 [n][a] 1.1632
3 1.40880 [n’][e] 0.8438
4 1.40198 [s][t] 1.0791
5 1.38280 [p][o] 1.0479
6 1.25078 [o][v] 1.1829
7 1.08491 [r][a] 0.9189
8 1.03023 [o][n] 0.8756
9 1.01037 [r][o] 0.9155
10 0.99573 [v][a] 0.8012
11 0.94847 [t][a] 0.8035
12 0.88593 [k][o] 0.7337
13 0.84639 [j][a] 0.6367
14 0.79998 [o][e]0.506
15 0.79985 [v][j] 0.442
16 0.79298 [d][o] 0.6459
17 0.76749 [e][j] 0.6620
18 0.76340 [a][w] 0.5595
19 0.75699 [t][e] 0.6229
20 0.74761 [t][o] 0.5814
21 0.72197 [z][a] 0.497
22 0.71902 [e][m] 0.60411
23 0.71384 [g][o] 0.515
24 0.69492 [e][n] 0.6768
25 0.68893 [S][e] 0.456
26 0.68631 [k][a] 0.540
27 0.67323 [n][e] 0.60803
28 0.67050 [v][I] 0.526
29 0.66791 [l][i] 0.58227
··· ··· ··· ···
probability only on the preceding N1wordsdefinedby
the following formula:
P(W)=P(w1)
n
i=1
P(wi|wiN+1,...,wi1)(12)
This approximation is commonly referred to as N-gram
model [17]. The most popular solutions published in the
literature, relate to the application of N-gram language
models for word-based speech recognition tasks [56–59].
Table 14 Frequency of the three-phoneme sequence
occurrence in Polish—comparison to the results published in the
literature [45]
No. Obtained results Triphone [SAMPA] Results in [45]
f(qi2,qi1,qi)[%] [ qi2][qi1][qi]f(qi2,qi1,qi)[%]
1 0.67353 [v][j][e] 0.3159
2 0.62188 [e][g][o] 0.3655
3 0.56670 [o][v][a] 0.3801
4 0.52262 [s][t][a] 0.3287
5 0.51677 [p][S][e] 0.2969
6 0.36109 [m][j][e] 0.2503
7 0.34557 [e][s][t] 0.1734
8 0.33317 [o][n][ts] 0.1749
9 0.32041 [p][r][a] 0.1681
10 0.31920 [o][s’][ts’] 0.1533
11 0.31277 [j][o][n] 0.189
12 0.28832 [j][o][e]0.143
13 0.27851 [p][S][I] 0.118
14 0.27638 [k][t][u] 0.1311
15 0.27428 [p][r][o] 0.1807
16 0.26887 [t][u][r] 0.1448
17 0.25608 [n][I][x] 0.1673
18 0.25453 [o][v][j] 0.1404
19 0.25330 [j][e][s] n.a.
20 0.25049 [o][s][t] 0.1785
21 0.24914 [p][j][e] 0.120
22 0.24394 [e][n][t] 0.1842
23 0.23760 [a][j][o] 0.126
24 0.23674 [a][l][e] 0.116
25 0.22874 [s’][ts’][i] 0.130
26 0.22739 [p][o][v] 0.122
27 0.22454 [a][n’][e] 0.1586
28 0.21298 [s][p][o] 0.1627
29 0.20990 [o][v][e] 0.1712
··· ··· ··· ···
The language modelling may be based on modelling
of words, as well as sub-words (e.g. phonemes). Statisti-
cal analysis of the phonemic corpus enables to develop
statistical language models, based on phonemes.
ForsequenceofthephonemesQ=q1...qm,con-
taining mphonemes qi, the probability P(Q)is given
by a phoneme-based language model and the following
formula:
P(Q)=P(q1)
m
i=2
P(qi|q1,...,qi1)(13)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 12 of 16
100102104106108
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1 R2 = 0.90729; RMSE = 7.6475e−006; a=0.041566; b=0.9
word rank
Frequency of occurrence
words
Zipf distribution
Fig. 5 Fit of Zipf’s equation to the ranked frequency distribution of
Polish words
where P(qi|q1,...,qi1)is the conditional probability
that qiwill occur, given the previous phoneme sequence
q1,...,qi1.TheP(Q)probability approximation for
N-gram phoneme-based language model is defined by the
analogous formula:
P(Q)=P(q1)
m
i=2
P(qi|qiN+1,...,qi1)(14)
On the basis of performed statistical analysis of the
orthographic language corpus, there have been devel-
oped the N-gram word-based language models for N=
1, ..., 3, intended for Polish language. In a similar way,
on the basis of statistical analysis results of the phonemic
language corpus, the N-gram phoneme-based language
models for N=1, ..., 3, intended for Polish language,
100101102
10−3
10−2
10−1 R2 = 0.92198; RMSE = 0.0067012; a=0.095948; b=0.149; c=0.94
phoneme rank
Frequency of occurrence
phonemes
Zipf distribution
Yule distribution
Fig. 6 Fits of Zipf’s and Yule’s equations to the ranked frequency
distribution of Polish phonemes
Table 15 Evaluation results of the fits of Zipf’s and Yule’s
equations to the ranked frequency distribution of Polish
phonemes
No. Equation R2RMSE
1Zr=0.095948
r0.5 0.81180 1.0408 ·102
2Yr=0.095948
r0.149 ·0.94r0.92198 6.7012 ·103
were developed. The details of word-based and phoneme-
based language models developing process are presented
in the separate publication. This article presents only
the example of language statistical analysis application to
develop selected language models.
An approach to evaluate a language model is word
recognition error rate [60].
However, this approach requires a working speech
recognition system. Alternatively, we can measure the
average number of possible words that follow any given
word sequence in a language. This is the derivative mea-
sure of entropy, known as perplexity (PP) [17]. Given a
language model P(W),whereWis the n-word sequence,
the entropy of the language model can be defined as [61]:
H(W)=−
1
nlog2(P(W)) (15)
For N-gram language model, H(W)entropy can be
calculated with the following formula:
H(W)=−
1
n
n
i=1
log2(P(wi|wiN+1,...,wi1)) (16)
Note that as napproaches infinity, the entropy
approaches the asymptotic entropy of the source defined
by the measure P(W). This means that the typical length
of the sequence must approach infinity, which is of course
impossible. Thus, entropy H(W)should be estimated on
100101102103104
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1 R2 = 0.99532; RMSE = 0.00013393; a=0.020909; b=0.287; c=0.994
2−phoneme sequence rank
Frequency of occurrence
2−phoneme sequence
Yule distribution
Fig. 7 Fit of Yule’s equation to the ranked frequency distribution of
Polish two-phoneme sequences
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 13 of 16
100101102103104105
10−10
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2 R2 = 0.98932; RMSE = 2.2056e−005; a=0.0067353; b=0.35; c=0.9992
3−phoneme sequence rank
Frequency of occurrence
3−phoneme sequence
Yule distribution
Fig. 8 Fit of Yule’s equation to the ranked frequency distribution of
Polish three-phoneme sequences
a sufficient large nvalue. The perplexity PP(W)of the
word-based language model is then defined as [17]:
PP(W)=2H(W)(17)
The comparison of perplexity PPN(W)values for
the developed word-based N-gram language models for
N=1, ..., 3, is presented in Table 17. The comparison
of perplexity PPN(Q)values for the developed phoneme-
based N-gram language models for N=1, ...,3, is
presented in Table 18.
The PP values, presented in Tables 17 and 18, show that
the developed phoneme-based 3-gram language model
has the lowest PP value equal to 7.77. The lower perplexity
value for language model indicates a greater ability to pre-
dict sequence of speech components. A language model is
rated as better if the perplexity PP value is less. A language
models with low perplexity indicate more predictable lan-
guage. However, since the perplexity is not related to
the complexity of recognizing some acoustic patterns,
reducing the language model, perplexity does not guar-
antee an improvement in automatic speech recognition
performance.
Table 16 Evaluation results of the fit of Yule’s equation to the
ranked frequency distribution of Polish phonemes (n=1)and
n-phoneme sequences for n=2, ...,5
nYule’s equation R2RMSE
1Yr=0.095948
r0.149 ·0.94r0.92198 6.7012 ·103
2Yr=0.020909
r0.287 ·0.994r0.99532 1.3393 ·104
3Yr=0.0067353
r0.35 ·0.9992r0.98932 2.2056 ·105
4Yr=0.0031449
r0.35 ·0.9997r0.98510 5.5017 ·106
5Yr=0.0019385
r0.33 ·0.9998r0.98028 2.6493 ·106
Table 17 Comparison of perplexity PPN(W)values for the
developed word-based N-gram language model for N=1, ...,3
N Perplexity Value Word-based language model
1PP1(W)9317.1 1-gram word-based language model
2PP2(W)933.0 2-gram word-based language model
3PP3(W)278.9 3-gram word-based language model
Potential application of other statistical analysis results
The statistical analysis results for 4 and 5-word sequence
occurrence are not presented in this paper. But these
results can be helpful to develop advanced (4 and 5-gram)
word-based language models for Polish. As previously
written, the language modelling may be based on mod-
elling of words, as well as sub-words (e.g., phonemes).
Therefore, the statistics of higher than three-phoneme
sequence can be used for developing advanced (higher
than 3-gram) phoneme-based language models for Polish.
The advanced word-based and phoneme-based language
modelling, enables to develop a hybrid language mod-
els for out-of-vocabulary (OOV) word detection in large
vocabulary conversational speech recognition (LVCSR)
systems for the language [62, 63]. The language model
in most state-of-the-art LVCSR systems is still the N-
gram, which assigns probability to the next word based
on only the N1 preceding words [64]. But the use of
an additional phoneme-based language models improves
efficiency of LVCSR systems [65]. Another improvement
in an LVCSR system development is the use of higher
than 4-gram language models, with particular emphasis
on N-gram phoneme-based language models.
Conclusions
This paper presents the original results of statistical anal-
ysis of Polish language, performed by means of the ortho-
graphic language text corpus, obtained from the NCP
corpus and the phonemic language corpus, developed
through automatic grapheme-to-phoneme conversion of
the orthographic language corpus. The results of statisti-
cal analysis of Polish language, enable to develop statisti-
cal word-based and phoneme-based language models, in
order to be used for automatic speech recognition.
The results of the research on statistical analysis of
Polish language were compared and are consistent to
other results available in the literature [40–46, 66, 67].
Table 18 Comparison of perplexity PPN(Q)values for the
developed phoneme-based N-gram language model for
N=1, ...,3
N Perplexity Value Phoneme-based language model
1PP1(Q)22.08 1-gram phoneme-based language model
2PP2(Q)10.85 2-gram phoneme-based language model
3PP3(Q)7.77 3-gram phoneme-based language model
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 14 of 16
Taking into account the results available in the literature,
it can be concluded that performed statistical analysis
of the language was extensive. No results of the statisti-
cal analysis of n-phoneme sequence occurrence in Polish
for n>3 were found in the literature. On the basis of
the comparison results, the following conclusion can be
drawn: The phonemic language corpus in Polish which
used to perform statistical analysis of the language, was
very huge (containing 1,263,248,497 phonemes) and the
statistical analysis results, obtained and based on it, allows
to develop statistical models of Polish language.
Additionally, the validation and evaluation of the
obtained statistical data were performed. The frequency
of the word occurrence in a language is well described
by Zipf s law. The validation of statistical data for words
was performed by the fit of Zipf’s equation to the ranked
frequency distribution of Polish words. Similar regularity
was observed for frequency distribution of the phoneme
occurrence for Polish language. The examination of fre-
quency occurrence in 95 languages, presented in the lit-
erature [51], shows that phoneme frequencies are best
described by Yule’s equation [54]. The validation of the
statistical data for phonemes was performed by the fit
of Yule’s equations to the ranked frequency distribution
of Polish phonemes and n-phoneme sequences. Accord-
ing to the results available in the literature [51], it can
be concluded that statistical data obtained as the result
of performed statistical analysis of Polish language, based
on the orthographic and phonemic language corpora, are
correct.
Regularity presented in this paper, it is not an iso-
lated case and similar regularity can be observed in other
languages, so also for other language corpora, reflecting
the state of contemporary language [51]. It should also
be noted, that it seems to be valuable to provide simi-
lar fits for existing Polish text corpora for allowing the
reader to assess the quality of the created phonemic lan-
guage corpus. Similarly, it seems to be very valuable to
confront word error rate and the perplexity of the lan-
guage models, created by means of the existing Polish
corpora with respect to a common test set. However, it is
difficult to perform due to lack of access to other exist-
ing Polish text corpora of appropriate size and quality,
except NCP corpus. Similarly, the author does not find
any available phonemic language corpus for Polish. There-
fore, the author attempts to create his own phonemic
language corpus with the use of G2P conversion of the
existing available orthographic language corpus for Polish
(NCP). Since this problem seems to be very important, the
author is planning to bring this subject up in the future
publications.
The developed word-based and phoneme-based lan-
guage models were also presented in this paper, as
an example of practical applications of the obtained
statistical data of Polish language. The obtained statistical
data open up further opportunities to continue research
on improving automatic speech recognition in Polish. The
plan for future research includes the development of sta-
tistical word-based and subword-based language models
for Polish. The word-based and subword-based language
modelling, enables to develop a hybrid language models
for out of vocabulary word detection in large vocabulary
conversational speech recognition [64, 68–70].
Acknowledgements
This work was supported by the Polish Ministry of Science and Higher
Education funding for statutory activities.
Competing interests
The authors declare that they have no competing interests.
Received: 15 April 2016 Accepted: 8 February 2017
References
1. L Rabiner, B Juang, Fundamentals of Speech Recognition. Prentice Hall signal
processing series. (PTR Prentice Hall, USA, 1993)
2. JR Bellegarda, C Monz, State of the art in statistical methods for language
and speech processing. Comput. Speech Lang. 35, 163–184 (2016)
3. L Rabiner, B Juang, Encyclopedia of Language and Linguistics, Statistical
methods for the recognition and understanding of speech. (Elsevier,
Amsterdam, 2005)
4. S Sakti, K Markov, S Nakamura, W Minker, in Incorporating Knowledge
Sources into Statistical Speech Recognition, vol 42 of Lecture Notes in
Electrical Engineering. Statistical Speech Recognition (Springer US, USA,
2009), pp. 19–53
5. J Bellegarda, Large vocabulary speech recognition with multispan
statistical language models. IEEE Transa. Speech Audio Process. 8,
76–84 (2000)
6. P Kłosowski, in Computer Nerworks vol 79 of Communications in Computer
and Information Science, ed. by A Kwiecien, P Gaj, and P Stera. Speech
processing application based on phonetics and phonology of the polish
language. 17th International Conference Computer Networks, Ustron,
Poland, Jun 15-19 (Springer-Verlag, Berlin, 2010), pp. 236–244
7. P Kłosowski, Improving speech processing based on phonetics and
phonology of Polish language. Przegl ˛ad Elektrotechniczny. 89,
303–307 (2013)
8. J Izydorczyk, P Kłosowski, Acoustic properties of Polish vowels. Bull. Pol.
Acad. Sci. Tech. Sci. 47(1), 29–37 (1999)
9. J Izydorczyk, P Kłosowski, in International Conference Programable Devices
and Systems PDS2001 IFAC Workshop, Gliwice November 22nd - 23rd.Base
acoustic properties of Polish speech (IFAC, Gliwice, 2001), pp. 61–66
10. P Kłosowski, A Dustor, J Izydorczyk, J Kotas, Slimok J, in Computer
Networks, CN 2014. vol 431 of Communications in Computer and Information
Science, ed. by A Kwiecien, P Gaj, and P Stera. Speech recognition based
on open source speech processing software. 21st International Science
Conference on Computer Networks (CN), Brunow, Poland, Jun 23-27
(Springer-Verlag, Berlin, 2014), pp. 308–317
11. A Dustor, Kłosowski P, in Computer Networks, CN 2013. vol 370 of
Communications in Computer and Information Science, ed. by A Kwiecien,
P Gaj, and Stera P. Biometric voice identification based on Fuzzy Kernel
Classifier. 20th International Conference on Computer Networks (CN),
Lwowek Slaski, Poland, Jun 17-21 (Springer-Verlag, Berlin, 2013),
pp. 456–465
12. A Dustor, P Kłosowski, J Izydorczyk, in 2014 International Conference on
Multimedia Computing and Systems (ICMCS). Speaker recognition system
with good generalization properties. International Conference on
Multimedia Computing and Systems (ICMCS), Marrakech, Morocco, Apr
14-16 (IEEE, USA, 2014), pp. 206–210
13. A Dustor, P Kłosowski, J Izydorczyk, in Computer Networks, CN 2014. vol 431
of, Communications in Computer and Information Science, ed. by A
Kwiecien, P Gaj, and P Stera. Influence of Feature Dimensionality and
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 15 of 16
Model Complexity on Speaker Verification Performance. 21st
International Science Conference on Computer Networks (CN), Brunow,
Poland, Jun 23-27 (Springer-Verlag, Berlin, 2014), pp. 177–186
14. P Kłosowski, A Dustor, J Izydorczyk, in Computer Networks, CN 2015. vol 522
of Communications in Computer and Information Science, ed. by P Gaj, A
Kwiecien, and P Stera. Speaker verification performance evaluation based
on open source speech processing software and timit speech corpus.
22nd International Conference on Computer Networks (CN), Brunow,
Poland, Jun 16-19 (Springer-Verlag, Berlin, 2015), pp. 400–409
15. A Dustor, P Kłosowski, J Izydorczyk, R Kopanski, in Computer Networks, CN
2015. vol 522 of Communications in Computer and Information Science,ed.
by P Gaj, A Kwiecien, and P Stera. Influence of Corpus Size on Speaker
Verification. 22nd International Conference on Computer Networks (CN),
Brunow, Poland (Springer-Verlag, Berlin, 2015), pp. 242–249
16. P Kłosowski, Dustor A, in Computer Networks, CN 2013. vol 370 of
Communications in Computer and Information Science, ed. by A Kwiecien,
P Gaj, and P Stera. Automatic Speech Segmentation for Automatic
Speech Translation. 20th International Conference on Computer
Networks (CN), Lwowek Slaski, Poland, Jun 17-21 (Springer-Verlag, Berlin,
2013), pp. 466–475
17. F Jelinek, Statistical Methods for Speech Recognition. Language, Speech, &
Communication: A Bradford Book. (MIT Press, USA, 1997)
18. S Furui, Recent progress in corpus-based spontaneous speech
recognition. IEICE Trans. Inf. Syst. E88D, 366–375 (2005)
19. M Adda-Decker, Corpus for automatic speech recognition. Revue
Francaise De Linguistique Appliquee. 12, 71–84 (2007)
20. A Przepiórkowski, M Ba ´
nko, RL Górski, B Lewandowska-Tomaszczyk, The
National Corpus of Polish (in Polish: Narodowy Korpus J˛ezyka Polskiego).
(Wydawnictwo Naukowe PWN, Warszawa, 2012)
21. A Przepiórkowski, RL Górski, B Lewandowska-Tomaszczyk, Łazi ´
nski M, in
Proceedings of the Sixth International Conference on Language Resources
and Evaluation, LREC 2008. Towards the national corpus of Polish
(Marrakech, ELRA, 2008)
22. RL Górski, B Lewandowska-Tomaszczyk, M Ba ´
nko, P P˛ezik, M Łazi ´
nski, A
Przepiórkowski, Practical applications of the National Corpus of Polish.
Prace Filologiczne. 63, 231–240 (2012)
23. J Hirschberg, CD Manning, Advances in natural language processing.
Science. 349, 261–266 (2015)
24. Association International Phonetic, Handbook of the International Phonetic
Association: A Guide to the Use of the International Phonetic Alphabet. A
Regents publication. (Cambridge University Press, UK, 1999)
25. R Sussex, P Cubberley, The Slavic Languages. Cambridge Language Surveys.
(Cambridge University Press, UK, 2006)
26. J Wells, in Handbook of Standards and Resources for Spoken Language
Systems. vol Part IV, section B, ed. by D Gibbon, R Moore, and R Winski.
SAMPA computer readable phonetic alphabet (Mouton de Gruyter, Berlin
and New York, 1997)
27. M Razavi, R Rasipuram, MM Doss, Acoustic data-driven
grapheme-to-phoneme conversion in the probabilistic lexical modeling
framework. Speech Commun. 80, 1–21 (2016)
28. RM Kaplan, M Kay, Regular models of phonological rule systems. Comput.
Linguist. 20, 331–378 (1994)
29. M Steffen-Batóg, The problem of automatic phonemic transcription of
written Polish. Biuletyn Fonograficzny. 14, 75–86 (1973)
30. M Steffen-Batóg, in Polish: Automatyzacja transkrypcji fonematycznej
tekstów polskich. Automatic phonemic transcription of Polish texts
(Wydawnictwo Naukowe PWN, Warszawa, 1975)
31. M Steffen-Batóg, Nowakowski P, in Studia Phonetica Posnaniensia. Vol. 3,
ed. by M Steffen-Batóg, W Awedyk. An algorithm for phonetic
transcription of orthographic texts in Polish (Wydawnictwo Naukowe
UAM, Pozna´
n, 1993)
32. W Jassem, A phonemic transcription and syllable division rule engine.
(Onomastica-Copernicus Research Colloquium, Edinburgh, 1996)
33. P Kłosowski, in Proceedings of 20th IEEE International Conference Signal
Processing Algorithms, Architectures, Arrangements, and App.lications,
September 21-23. Algorithm and implementation of automatic phonemic
transcription for polish (Poznan University of Technology, Pozna ´
n, 2016),
pp. 298–303
34. M Wypych, in Speech and Language Technology. Vol. 3. Implementation
of phonenic transcription alghorithm (in Polish: Implementacja
algorytmu transkrypcji fonematycznej) (Polskie Towarzystwo Fonetyczne,
Pozna´
n, 1999)
35. G Demenko, M Wypych, E Baranowska, Implementation of
grapheme-to-phoneme rules and extended SAMPA alphabet in Polish
text-to-speech synthesis. Speech Lang. Technol. 7(17) (2003)
36. P Przybysz, W Kasprzak, in 2013 6th International Conferance on Human
Systems Interactions (HSI), ed. by WA Paja, BM Wilamowski. The generation
of letter-to-sound rules for grapheme-to-phoneme conversion.
Conference on Human System Interaction. Gdansk Univ Technol; Univ
Informat Technol & Management; IEEE Ind Elect Soc (Gdansk University of
Techlology, Gdansk, 2013), pp. 292–297
37. D Skurzok, B Ziółko, Ziółko M, in 7th Language & Technology Conference,
Pozna´
n. Ortfon2 - tool for orthographic to phonetic transcription (Adam
Mickiewicz University in Poznan, Poznan, 2015)
38. D Koržinek, Ł Brocki, Marasek K, Polish grapheme-to-phoneme tool and
service, CLARIN-PL digital repository (2016). http://hdl.handle.net/11321/
295, (Online: 2016.08.01)
39. Wiktionary, Polish Language Dictionary (2015). https://pl.wiktionary.org/.
Accessed 17 Feb 2017
40. W Jassem, Podstawy fonetyki akustycznej (eng. Rudiments of acoustic
phonetics). (PWN, Warszawa, 1973)
41. P Łobacz, W Jassem, Fonotaktyczna analiza mówionego tekstu polskiego
(eng. Phonotactic analysis of spoken Polish texts). Biuletyn Polskiego
Towarzystwa J˛e. 32, 179–195 (1974)
42. C Basztura, Rozmawiac z komputerem (Eng. To speak with computers), (1992)
43. B Ziółko, J Gałka, S Manandhar, RC Wilson, M Ziółko, in Human Language
Technology: Challenges of the Information Society. Vol 5603 of Lecture Notes
in Artificial Intelligence, ed. by Z Vetulani, H Uszkoreit. Triphone Statistics
for Polish Language. 3rd Language and Technology Conference 2007,
Poznan, Poland, Oct 05-07, (2009), pp. 63–73
44. B Ziółko, J Gałka, M Ziółko, Polish phoneme statistics obtained on large
set of written texts. Comput. Sci. (AGH). 10, 97–106 (2009)
45. B Ziółko, Gałka J, in Computer Science and Information Technology (IMCSIT),
Proceedings of the 2010 International Multiconference on. Polish phones
statistics (AGH Univesity of Science and Technology, Krakow, 2010),
pp. 561–565
46. B Ziółko, P Zelasko, Skurzok D, in 2014 XXII Annual Pacific Voice Conference
(PVC). Statistics of diphones and triphones presence on the word
boundaries in the Polish language. App.lications to ASR. Annual Pacific
Voice Conference, AGH; Pacific Voice Speech Fdn, 2014. 22nd Annual
Pacific Voice Conference (PVC) (Krakow, AGH Univesity of Science and
Technology, 2014)
47. D Lightfoot, The development of language: Acquisition, change, and
evolution. (Wiley-Blackwell, Hoboken, 1999)
48. D Biber, S Conrad, R Repp.en, Corpus linguistics: Investigating language
structure and use. (Cambridge University Press, Cambridge, 1998)
49. R Facchinetti, M Rissanen, Corpus-based studies of diachronic English, vol.
31. (Peter Lang, 2006)
50. GK Zipf, Human behavior and the principle of least effort. J. Clin. Psychol.
6(3), 306–306 (1950)
51. Y Tambovtsev, C Martindale, Phoneme frequencies follow a yule
distribution. SKASE J. Theor. Linguist. 4(2) (2008)
52. ST Piantadosi, Zipf’s word frequency law in natural language: A critical
review and future directions. Psychonimic Bull. Rev. 21, 1112–1130 (2014)
53. A Corral, G Boleda, R Ferrer-i Cancho, Zipf’s law for word frequencies:
word forms versus lemmas in long texts. Plos ONE. 10(7), e0129031
(2015). doi:10.1371/journal.pone.0129031
54. GU Yule, A mathematical theory of evolution, based on the conclusions
of Dr.J. C. Willis, F.R.S. Phil. Trans. R. Soc. London B Biol Sci. 213(402-410),
21–87 (1925)
55. S Dziadzio, A Nabo ˙
zny, A Smywi´
nski-Pohl, B Ziółko, in Computer Science
and Information Systems (FedCSIS) 2015 Federated Conference on.
Comparison of language models trained on written texts and speech
transcripts in the context of automatic speech recognition (Lodz
University of Technology, Lodz, 2015), pp. 193–197
56. S Takahashi, T Morimoto, in 2012 International Conference on Asian
Language Processing (IALP 2012), ed. by D Xiong, E Castelli, M Dong, and
PTN Yen. N-gram Language Model Based on Multi-Word Expressions in
Web Documents for Speech Recognition and Closed-Captioning
(Soochow University, China, 2012), pp. 225–228
57. A Hatami, A Akbari, B Nasersharif, in 2013 21st Iranian Conference on
Electrical Engineering (ICEE). N-gram Adaptation Using Dirichlet Class
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Kłosowski EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:5 Page 16 of 16
Language Model Based on Part-of-Speech for Speech Recognition
(Ferdowsi University of Mashhad, Mashhadm, 2013)
58. M Bahrani, H Sameti, N Hafezi, S Momtazi, in New Frontiers in App.lied
Artificial Intelligence, vol 5027 of Lecture Notes in Artificial Intelligence,ed.by
NT Nguyen, L Borzemski, A Grzech, and M Ali. New word clustering
method for building n-gram language models in continuous speech
recognition systems (Springer, Berlin, 2008), pp. 286–293
59. B Rapp, in 2008 International Multiconference on Computer Science and
Information Technology (IMCSIT), Vols 1 and 2, ed. by M Ganzha, M
Paprzycki, and T PelechPilichowski. N-gram language models for Polish
language. Basic concepts and applications in automatic speech
recognition systems (IEEE Computer Society Press, Los Alamitos, 2008),
pp. 295–298
60. D Klakow, P Jochen, Testing the correlation of word error rate and
perplexity. Speech Commun. 38(1–2), 19–28 (2002)
61. T Cover, J Thomas, Wiley series in telecommunications: Elements of
information theory. (John Wiley and Sons, USA, 1991)
62. P Yu, FTB Seide, in Interspeech. A hybrid word/phoneme-based app.roach
for improved vocabulary-independent search in spontaneous speech
(Citeseer, Jeju Island, 2004)
63. V Chunwijitra, A Chotimongkol, C Wutiwiwatchai, A hybrid input-type
recurrent neural network for lvcsr language modeling. EURASIP J. Audio
Speech Music Process. 2016(1), 15 (2016)
64. A Yazgan, M Saraclar, in Acoustics, Speech, and Signal Processing, 2004.
Proceedings.(ICASSP’04). IEEE International Conference on. Hybrid language
models for out of vocabulary word detection in large vocabulary
conversational speech recognition. vol 1 (IEEE, 2004), pp. I–745
65. M Larson, Sub-word-based language models for speech recognition:
implications for spoken document retrieval. Whorkshop on Language
Modeling and Information Retrieval (2001)
66. A Czardybon, O Hellwig, W Petersen, in Advances in Natural Language
Processing. vol 8686 of Lecture Notes in Artificial Intelligence, ed. by A
Przepiorkowski, M Ogrodniczuk. Statistical Analysis of the Interaction
between Word Order and Definiteness in Polish. Polish Acad Sci, Inst
Comp Sci, 2014. 9th International Conference on Natural Language
Processing (NLP), Warsaw, Poland, Sep 17-19 (Polish Academy of Science,
Institute of Conputer Scince, Warsaw, 2014), pp. 144–150
67. P Mandera, E Keuleers, Z Wodniecka, M Brysbaert, Subtlex-pl:
subtitle-based word frequency estimates for Polish. Behav. Res. Methods.
47, 471–483 (2015)
68. JR Bellegarda, Large vocabulary speech recognition with multispan
statistical language models. IEEE Trans. Speech Audio Process. 8,
76–84 (2000)
69. H Schwenk, Continuous space language models. Comput. Speech Lang.
21(3), 492–518 (2007)
70. MAB Shaik, E-D AMousa, R Schlüter, H Ney, in INTERSPEECH. Hybrid
language models using mixed types of sub-lexical units for open
vocabulary German LVCSR (International Speech Communication
Association (ISCA), Baixas, 2011), pp. 1441–1444
Submit your manuscript to a
journal and benefi t from:
7 Convenient online submission
7 Rigorous peer review
7 Immediate publication on acceptance
7 Open access: articles freely available online
7 High visibility within the fi eld
7 Retaining the copyright to your article
Submit your next manuscript at 7 springeropen.com
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... In addition, research studies have been conducted on the phonetic properties of Polish phonemes [7,8], speech recognition based on such analyses [9], speaker recognition [10][11][12][13][14] and new applications of speech and language processing (e.g., speech translation) [15]. Particularly good results in speech recognition are achieved through the use of statistical language models [16][17][18][19]. However, currently the field of language modelling is shifting from statistical methods to neural networks and deep learning methods [20,21]. ...
... However, currently the field of language modelling is shifting from statistical methods to neural networks and deep learning methods [20,21]. The development of statistical and deep learning methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora [18][19][20][21][22][23][24][25]. The main motivation for undertaking this research on automatic grapheme-to-phoneme conversion and its application, was the development of effective methods of creating a phonemic language corpus for Polish, comprised of phonemic transcriptions derived from an orthographic language corpus through graphemeto-phoneme conversion. ...
... Automatic grapheme-to-phoneme conversion allows creating large phonemic language corpora from orthographic language corpora. A phonemic language corpus for Polish was developed by the author using automatic grapheme-to-phoneme conversion of an orthographic language corpus, in order to be able to perform statistical phonological analysis of the Polish language, and to develop phoneme-based statistical language models for Polish to improve automatic speech recognition [17][18][19]. ...
Article
Full-text available
This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora.
... Investigating the frequency of phonemes, and of sequences of two or three phonemes (especially across word boundaries compared to word-internally), has been proposed in past research mainly to improve speech recognition system with statistical language modelling (Jassem, 1973;Basztura, 1992;Ziółko et al., 2009;Ziółko and Gałka, 2010;Kłosowski, 2017). However, such explorations are also useful to theorists investigating language variation in synchrony and language evolution through the lens of frequency-based exemplar models (Bybee, 2002). ...
... In terms of consonants, we also see that the consonants that have a low ratio in previous studies 5 Other studies, such as (Ziółko et al., 2014), have explored the frequency of diphones and triphones in oral corpora, but not that of single phonemes. Table 3: Rates (%) of each phoneme in 5 past corpora (Jassem, 1973;Basztura, 1992;Ziółko et al., 2009;Ziółko and Gałka, 2010;Kłosowski, 2017) and in Lingua Libre (abbreviated as LiLi). The frequencies from Lingua Libre are extracted based on all the recorded words available in Lingua Libre. ...
Conference Paper
Full-text available
Oral corpora for linguistic inquiry are frequently built based on the content of news, radio, and/or TV shows, sometimes also of laboratory recordings. Most of these existing corpora are restricted to languages with a large amount of data available. Furthermore, such corpora are not always accessible under a free open-access license. We propose a crowd-sourced alternative to this gap. Lingua Libre is the participatory linguistic media library hosted by Wikimedia France. It includes recordings from more than 140 languages. These recordings have been provided by more than 750 speakers worldwide, who voluntarily recorded word entries of their native language and made them available under a Creative Commons license. In the present study, we take Polish, a less-resourced language in terms of phonetic data, as an example, and compare our phonetic observations built on the data from Lingua Libre with the phonetic observations found by previous linguistic studies. We observe that the data from Lingua Libre partially matches the phonetic inventory of Polish as described in previous studies, but that the acoustic values are less precise, thus showing both the potential and the limitations of Lingua Libre to be used for phonetic research.
... One of them is building audio-visual speech corpora in which fast camera recordings and video analysis (visemes) support audio recognition (Almajai et al., 2016;Cooke et al., 2006;Dalka et al., 2014). This concerns both English and national databases (Czyzewski et al., 2017b;Kunka et al., 2013;Benezeth et al., 2011;Trojanová et al., 2008;Żelasko et al., 2016;Kłosowski, 2017). For the audio part, the utterances may be spoken at a slow and normal speech pace; they may also contain prosodic features to improve the learning process of the audio-visual speech recognition system. ...
... Parameters 37-56: mel-frequency cepstral coefficients (MFCCs). They were introduced by Mermelstein (1976) as a tool for speech recognition and are among the most widely used acoustic features in speech and audio processing (Kłosowski, 2017;Kupryjanow and Czyzewski, 2013). ...
Article
Full-text available
Automatic classification methods, such as artificial neural networks (ANNs), the k-nearest neighbor (kNN) and self-organizing maps (SOMs), are applied to allophone analysis based on recorded speech. A list of 650 words was created for that purpose, containing positionally and/or contextually conditioned allophones. For each word, a group of 16 native and non-native speakers were audio-video recorded, from which seven native speakers’ and phonology experts’ speech was selected for analyses. For the purpose of the present study, a sub-list of 103 words containing the English alveolar lateral phoneme /l/ was compiled. The list includes ‘dark’ (velarized) allophonic realizations (which occur before a consonant or at the end of the word before silence) and 52 ‘clear’ allophonic realizations (which occur before a vowel), as well as voicing variants. The recorded signals were segmented into allophones and parametrized using a set of descriptors, originating from the MPEG 7 standard, plus dedicated time-based parameters as well as modified MFCC features proposed by the authors. Classification methods such as ANNs, the kNN and the SOM were employed to automatically detect the two types of allophones. Various sets of features were tested to achieve the best performance of the automatic methods. In the final experiment, a selected set of features was used for automatic evaluation of the pronunciation of dark /l/ by non-native speakers.
... i.e. /ʂ/ occurs more frequently than /s/ does. On the contrary, in Polish, the frequency order of all sibilants is /s/ > /ɕ/ > /ʂ/, i.e. /s/ occurs more often than /ʂ/; see Kłosowski (2017). The relative rarity of /ʂ/ may make it more difficult than /s/ for at least some children. ...
... Prior to that study, the phonology of Polish was described in many sources (e.g., Gussmann, 2007;Jassem, 2003;Oliver, Szklanny, 2006). It should also be noted that much effort was performed by several Polish and Lithuanian research centers aiming at speech recognition, a few examples of which are given in here: (Kłosowski et al., 2014), analysis of acoustics speech properties (Izydorczyk, Kłosowski, 2001), adaptation of foreign language speech recognition engines for Lithuanian speech recognition (Rudzionis et al., 2009;Kasparaitis, 2008), development of phonemic language corpus for Polish (Kłosowski, 2017) by employing automatic grapheme-to-phoneme conversion of the source orthographic language corpus, obtained from the National Corpus of Polish (NCP) (Przepiórkowski et al., 2012), creating Polish phoneme statistics (Ziółko et al., 2009;2014), etc. ...
Article
Full-text available
The goal of this research is to find a set of acoustic parameters that are related to differences between Polish and Lithuanian language consonants. In order to identify these differences, an acoustic analysis is performed, and the phoneme sounds are described as the vectors of acoustic parameters. Parameters known from the speech domain as well as those from the music information retrieval area are employed. These parameters are time- and frequency-domain descriptors. English language as an auxiliary language is used in the experiments. In the first part of the experiments, an analysis of Lithuanian and Polish language samples is carried out, features are extracted, and the most discriminating ones are determined. In the second part of the experiments, automatic classification of Lithuanian/English, Polish/English, and Lithuanian/Polish phonemes is performed.
... As shown by [10], the frequency order of all sibilants in Putonghua is /ʂ/ > /ɕ/ > /s/), i.e. /ʂ/ occurs more frequently than /s/ does. On the contrary, in Polish, the frequency order of all sibilants is /s/ > / ɕ/ > /ʂ/, i.e. /s/ occurs more often than /ʂ/; see [6]. In addition, although quantitative studies are still missing, it has been reported that /ɕ/ is commonly used in motherese in Polish [4]. ...
Conference Paper
Full-text available
This paper reports acoustic characteristics of Polish sibilants acquired in the process of language learning. We tested the production of three phonemic sibilants /s, ʂ, ɕ/ produced by 81 Polish children ages 35 to 106 months. Our results based on an acoustic analysis complemented by a perceptual categorization test by adults reveal that the alveolo-palatal /ɕ/ is the first sound which becomes separated from the other sibilants in terms of F2 of the following vowel and centre of gravity. The next sounds acquired are /s/ and /ʂ/. In the perceptual test most errors were found for the retroflex /ʂ/ confirming its late acquisition in comparison to other sibilants.
Article
Full-text available
Substantial amounts of resources are usually required to robustly develop a language model for an open vocabulary speech recognition system as out-of-vocabulary (OOV) words can hurt recognition accuracy. In this work, we applied a hybrid lexicon of word and sub-word units to resolve the problem of OOV words in a resource-efficient way. As sub-lexical units can be combined to form new words, a compact set of hybrid vocabulary can be used while still maintaining a low OOV rate. For Thai, a syllable-based unit called pseudo-morpheme (PM) was chosen as a sub-word unit. To also benefit from different levels of linguistic information embedded in different input types, a hybrid recurrent neural network language model (RNNLM) framework is proposed. An RNNLM can model not only information from multiple-type input units through a hybrid input vector of words and PMs, but can also capture long context history through recurrent connections. Several hybrid input representations were also explored to optimize both recognition accuracy and computational time. The hybrid LM has shown to be both resource-efficient and well-performed on two Thai LVCSR tasks: broadcast news transcription and speech-to-speech translation. The proposed hybrid lexicon can constitute an open vocabulary for Thai LVCSR as it can greatly reduce the OOV rate to less than 1 % while using only 42 % of the vocabulary size of the word-based lexicon. In terms of recognition performance, the best proposed hybrid RNNLM, which uses a mixed word-PM input, obtained 1.54 % relative WER reduction when compared with a conventional word-based RNNLM. In terms of computational time, the best hybrid RNNLM has the lowest training and decoding time among all RNNLMs including the word-based RNNLM. The overall relative reduction on WER of the proposed hybrid RNNLM over a traditional n-gram model is 6.91 %.
Book
This book is about investigating the way people use language in speech and writing. It introduces the corpus-based approach to linguistics, based on analysis of large databases of real language examples stored on computer. Each chapter focuses on a different area of linguistics, including lexicography, grammar, discourse, register variation, language acquisition, and historical linguistics. Example analyses are presented in each chapter to provide concrete descriptions of the research methods and advantages of corpus-based techniques. Ten methodology boxes provide clear and concise explanations of the issues in doing corpus-based research and reading corpus-based studies and there is a useful appendix of resources for corpus-based investigation. This lucid and comprehensive introduction to the subject will be welcomed by a broad range of readers, from undergraduate students to professional researchers.
Conference Paper
The article presents rule-based automatic phonemic transcription method for Polish, implemented by the author in Python programming language. Automatic phonemic transcription application required: phonemic transcription rules formulation and implementation, and automatic phonemic transcription algorithm implementation. As the implementation result, automatic phonemic transcription application was developed by the author, which enables to execute automatic phonemic transcription of any orthographic text files in Polish. The use of large language corpus files, as source orthographic text input files for automatic phonemic transcription enables to create large phonemic language corpora for further speech and language processing research. The developed phonemic language corpus for Polish, opens up further opportunities to continue research on improving automatic speech recognition. The large phonemic language corpus enables to perform statistical analysis of the language and to develop statistical phonemic language models for improving automatic speech recognition by statistical methods.
Conference Paper
Creating of speaker recognition application requires advanced speech processing techniques realized by specialized speech processing software. It is very possible to improve the speaker recognition research by using speech processing platform based on open source software. The article presents the example of using open source speech processing software to perform speaker verification experiments designed to test various speaker recognition models based on different scenarios. Speaker verification efficiency was evaluated for each scenario using TIMIT speech corpus distributed by Linguistic Data Consortium. The experiment results allowed to compare and select the best scenario to build speaker model for speaker verification application.
Conference Paper
The scope of this paper is to check influence of the size of the speech corpus on the speaker recognition performance. Obtained results for TIMIT corpus are compared with results obtained for smaller database ROBOT. Additionally influence of feature dimensionality and size of the speaker model was tested. Achieved results show that the best results can be obtained for MFCC features. The lowest EER for larger TIMIT database are 4 times worse than the best result for ROBOT corpus which confirms that biometric systems should be tested on as large data sets as possible to assure that achieved error rates are statistically significant.
Article
One of the primary steps in building automatic speech recognition (ASR) and text-to-speech systems is the development of a phonemic lexicon that provides a mapping between each word and its pronunciation as a sequence of phonemes. Phoneme lexicons can be developed by humans through use of linguistic knowledge, however, this would be a costly and time-consuming task. To facilitate this process, grapheme-to-phoneme conversion (G2P) techniques are used in which, given an initial phoneme lexicon, the relationship between graphemes and phonemes is learned through data-driven methods. This article presents a novel G2P formalism which learns the grapheme-to-phoneme relationship through acoustic data and potentially relaxes the need for an initial phonemic lexicon in the target language. The formalism involves a training part followed by an inference part. In the training part, the grapheme-to-phoneme relationship is captured in a probabilistic lexical modeling framework. In this framework, a hidden Markov model (HMM) is trained in which each HMM state representing a grapheme is parameterized by a categorical distribution of phonemes. Then in the inference part, given the orthographic transcription of the word and the learned HMM, the most probable sequence of phonemes is inferred. In this article, we show that the recently proposed acoustic G2P approach in the Kullback–Leibler divergence-based HMM (KL-HMM) framework is a particular case of this formalism. We then benchmark the approach against two popular G2P approaches, namely joint multigram approach and decision tree-based approach. Our experimental studies on English and French show that despite relatively poor performance at the pronunciation level, the performance of the proposed approach is not significantly different than the state-of-the-art G2P methods at the ASR level.
Article
This contribution aims at giving an overview of automatic speech recognition research, highlighting the needs for corpora development. As recognition systems largely rely on statistical approaches, large amounts of both spoken and written corpora are required. In order to fill the gap between written and spoken language, speech transcripts need to be produced manually using dedicated tools. Methods and resources accumulated over years now allow, not only to tackle genuine oral genres, but also to envision large-scale corpus studies to increase our knowledge of spoken language, (is well as to improve automatic processing.
Chapter
This chapter describes the state-of-the-art technology for statistical ASR based on the pattern recognition paradigm. The most widely used core technology is the hidden Markov model (HMM). This is basically a Markov chain that characterizes a speech signal in a mathematically tractable way. Section 2.1 provides an overview of pattern recognition. In Section 2.2, we review the theory of Markov chains and the general form of an HMM, including three practical problems in using HMMs. In Section 2.3, we describe in detail the pattern recognition task for HMM-based ASR systems, starting from feature extraction, which processes the speech signal into a set of feature patterns, up through the search algorithm, which maps those features into the most probable strings of words. We also explain language modeling, the pronunciation dictionary, and acoustic modeling, including phone-unit-dependent models, speech observation density, and various approaches to parameter trying.