Conference PaperPDF Available

Developing corpus management system for Bahasa Indonesia the “Perisalah” project

Authors:

Abstract and Figures

This paper present a report on the research and development of Indonesian corpus management system as part of the speech summarization system (Perisalah). The continuous improvement of the speech recognition for Indonesian language, require a better and larger monolingual corpus. We will discuss our method on building speech recognition. The system is equipped with a capability to handle variation of speech input, a more natural mode of communication between the system and the users. We discuss data contained in our text corpus and the corpus management system, mainly on how to handle sentence segmentation and unknown words (typos).
Content may be subject to copyright.
Developing Corpus Management System
for Bahasa Indonesia
The "Perisalah" Project
Teduh Ulinansyah, Hammam Riza and Oskar Riandi
Information and Computation Systems, ICT Center (PTIK)
Agency for the Assessment and Application of Technology (BPPT)
Jakarta, Indonesia
Abstract This paper present a report on the research and
development of Indonesian corpus management system as part of the
speech summarization system (Perisalah). The continuous
improvement of the speech recognition for Indonesian language,
require a better and larger monolingual corpus. We will discuss our
method on building speech recognition. The system is equipped with
a capability to handle variation of speech input, a more natural mode
of communication between the system and the users. We discuss data
contained in our text corpus and the corpus management system,
mainly on how to handle sentence segmentation and unknown words
(typos).
Keywords Corpus management system, speech processing,
natural language, bahasa Indonesia
I. INTRODUCTION
Nowadays, corpus have become an essential part of a
project dealing with natural language processing, and our
efforts in creating an Indonesian monolingual corpus started
when we were involved in Multilingual Machine Translation
Project (MMTS), a project to build a machine translation that
can translate sentences among five Asian languages
(Indonesian, Japanese, Malaysian, Thai, Chinese) and English.
This project initiated in 1985 and ended in 1991. After this
project, we actively have been involved in various projects in
the field of Natural Language Processing and Speech
Processing such as Universal Networking Language (UNL),
WebTrans (online English to Indonesian language translation
service), Lentera (Indonesian language learning software for
foreigners), IndoMorfo (Indonesian morphological analysis),
U-STAR (Universal Speech Translation Advanced Research),
and one of Indonesia national priority program called
Perisalah (an automatic transcription system for Indonesian
language), etc.
Our previous work has been focusing on developing
Indonesian speech recognition engine for the Perisalah.
Achievement of a high performance is often the most
dominating design criterion when implementing speech
recognition system. The current state of the art speech
recognition technology is able to produce speaker independent
recognizers which have extremely high recognition rates for
small/medium vocabularies. Although the average recognition
rates are high, some speakers have recognition rates
considerably worse than others. It is generally agreed that
speaker dependent system will give the best performance in
applications involving a specific speaker. This requires,
however, that enough training data is available for training the
system from scratch. An often used solution is to train speaker
independent system using data from many speakers. But other
experiments have shown that using such systems, in general,
involves obtaining a lower performance than what is
achievable with a speaker dependent system. This problem can
be overcome, at least partially, by using speaker adaptation
techniques, the aim of which is to take an initial model system
which is already trained, and use a sample of a new speaker
data to attempt to improve the modeling of the speaker with
the current set of the model. By collecting data from a speaker
and training a model set on this speaker's data alone, the
speaker's characteristics can be modeled more accurately.
Such systems are commonly known as speaker dependent
systems, and on a typical word recognition task, may have half
the errors of a speaker independent system. The drawback of
speaker dependent systems is that a large amount of data
(typically hours) must be collected in order to obtain sufficient
model accuracy. Rather than training speaker dependent
models, adaptation techniques can be applied. In this case, by
using only a small amount of data from a new speaker, a good
speaker independent system model set can be adapted to better
fit the characteristics of this new speaker.
Speaker adaptation techniques can be used in various
different modes. If the true transcription of the adaptation data
is known then it is termed supervised adaptation, whereas if
the adaptation data is unlabelled then it is termed unsupervised
adaptation. In the case where all the adaptation data is
available in one block, e.g. from a speaker enrollment session,
then this termed static adaptation. Alternatively adaptation can
proceed incrementally as adaptation data becomes available,
and this is termed incremental adaptation.
This paper mainly discusses the progress of improving our
Perisalah system by enlarging our text corpus. We also
introduce the corpus management system, elaborate on how
we handle sentence segmentation and unknown words (typos).
II. INDONESIAN MONOLINGUAL CORPUS
A. Related Work on Indonesian Language Resources
Up to now, the Indonesian monolingual corpus consists of
around 10.5 million unique sentences. This corpus has been
collected from various sources available on internet such as
national newspapers/magazines and governmental institutions
(presidential speech, meeting transcriptions, trial
transcriptions, etc.) by using HTTrack, a free offline browser
utility [1]. Table 1 lists all the available data corpus obtained
from various sources.
The Indonesian monolingual text corpus is now used as a
meta corpus to create language model for Perisalah, a speech
recognition and automatic transcription system for Indonesian
language. Therefore, it is essential that the meta corpus be as
clean as possible from errors such as mistype words (typos).
Segmenting a sentence is also important since correct sentence
segmentation will increase accuracy of language model used
by Perisalah.
Prior to this work, there have been some attempt on
creating Indonesian corpus with various genres such as
Indonesian PAN Localization Project corpora [7].
Although the history of corpora is relatively short the
technological advances enabled for creating many different
types of such sets of texts. Nowadays, apart from monolingual
corpora there are bilingual or multilingual corpora. Another
type is called sample corpora and those show a state of
language at a given point in time. A Reference corpus is one
that can reliably portray all the features of a language. There
are also historical corpora which aim at comparison of past
forms of a language with its present state; they can be
subdivided into two kinds depending of the features they want
to emphasize. Thus, diachronic corpora present samples of
language with intervals of about a generation of users, while
monitor corpora attempt to follow the language change while
it occurs. Apart from historical approaches there are topic
corpora which focus on a particular field of interest or a genre.
Genre is the first design criteria for the Indonesian
reference corpus that was built for PANL10n project [7].
Genres of spoken and written texts are being intensively
studied from various angles, e.g., communication studies,
discourse analysis, computational linguistics, without arriving
at a generally accepted definition (see: [8])
Building a reference corpus of web genres is certainly
difficult because web documents are often characterized by a
high level of genre hybridism, by a fragmentation of textual
quality across several documents, by the impact of technical
features such as hyper linking, posting facilities and multi-
authoring. Since the web is a huge reservoir of documents
that can be easily mined for building all sorts of corpora, it is
important to overcome the subjectivity that characterizes
genre-related issues, in order to create sharable resources.
What should we consider when designing a reference corpus
of web genres? Genres of web documents show some traits
that are not accounted for in TREC collections or in the BNC
and that are, instead, important on the web.
B. Corpus Management System
In order to manage several project related to improvement of
Perisalah system, we have developed Corpus management
system to keep track of various corpus with various genres.
The system is shown in Figure 1 centered around a portal with
a storage and retraining tools. In Figure 2, we shows process
of language model (LM) retraining which is needed for
improving the quality of transcription in a user system. Figure
3 shows the complete portal administration page for running
LM retraining project.
Figure 1. Corpus management portal with Retraining Tools
Figure 2. Language Model (LM) Retraining Tools
Figure 3. CMS showing Project LM Retraining
TABLE 1. LIST OF CORPUS SYSTEM
No.
Topic
Source
Number
of
articles
Number of
sentences
Number of
unique
sentences
Number of
words
1
Financial
Bank of Indonesia
124
115,431
113,615
3,081,380
2
Various
topics
DPR (House of
Representative)
355
205,405
202,816
4,293,868
3
Law
PN (District Court)
12
39,075
38,733
662,964
4
Various
topics
Presidential speech
16
1,268
1,266
24,695
5
Financial
Ministry of Finance
46
6,172
6,153
135,981
6
Various
topics
Mail archive
3,685
68,455
56,267
1,092,195
7
Financial
BPK (Supreme Audit Board)
501
862,542
831,334
35,521,560
8
Various
topics
DPD (House of Regional
Representative)
755
450,270
444,836
9,902,733
9
Politics
KPU (National Election
Commission)
1,176
23,503
16,734
399,182
10
Law
Ministry of Justice and
Human Rights
6,222
361,140
349,630
8,796,144
11
Literature
Novels
110,943
5,760,141
5,684,129
72,605,688
12
Various
topics
National
newspaper/magazine
28,795
609,728
609,275
12,484,728
13
Law
MK (Constitutional Court)
7,293
1,992,251
1,912,706
36,741,176
14
Various
topics
Combination of all above
159,923
10,495,381
10,445,098
185,602,460
III. TEXT PROCESSING OF INDONESIAN
In this section, a discussion on how handling relatively difficult
parts of recognizing sentence delimiters and unknown words
(including typos) is given.
A. Sentence Segmentation
Indonesian language has its own characteristics that it has
various affixes (prefixes, infixes, and suffixes) to create
derivatives. Since there are many tribes, consequently there are
many local languages in Indonesia. The number of local
languages in Indonesia is not exactly known, but it is estimated
that the number is around 719 local languages [2]. Although
there exists formal grammatical rules to create derivatives,
somehow these rules are 'ignored' by Indonesians (mainly
because of the influence of one's local language), for instance,
sometimes Indonesian write menyocoki while the correct form
is mencocoki.
Formal grammatical rules states that period, question
mark, or exclamation mark is used for ending a sentence (may
be followed by a double or single quote) such as "Kalau sudah
dilacak sekarang kan sudah aman," ujarnya or Dari DPRD
mereka baru menuju kantor Golkar. However, this rule doesn't
applied in titles of articles in newspaper or magazine. Instead,
frequently the articles title use capital letters such as RISALAH
RAPAT KERJA, or using a capital letter for first letter in each
word such as Kantor Golkar Bali Didemo Mahasiswa.
Therefore, in addition to recognize the end of a sentence
by following formal grammatical rules, a sentence
segmentation parser for Indonesian language must aware of
these 'habits' to recognize the above phenomenon.
B. Handlings Unknown Word and Typhos
Based on our experience, usually almost half of new words
(e.g., words that are not listed in existing word dictionary)
extracted from new corpora are typos. Our dictionary now has
around two hundred thousand words including pronunciation
data. A discussion of most frequent types of typos and how we
handle them is given below.
Use either SI (MKS) or CGS as primary units. (SI units
are encouraged.) English units may be used as
secondary units (in parentheses). An exception would
be the use of English units as identifiers in trade, such
as 3.5-inch disk drive.
In Indonesian language, a hyphen is used to express
plurality or repetitiveness. Mostly it is used for nouns
such as buku-buku, but is can be used also for
adjectives and verbs. A hyphen between two words,
combined with a wealthy set of affixes may create a
new word that is not been listed in our dictionary. For
instance susul-menyusuli (one after another). It is
almost impossible to imagine and then list all possible
derivatives created from combining Indonesian affixes.
Therefore, we now can only add 'new word' extracted
from meta corpus. Although susul-menyusuli is
grammatically correct, it has been regarded as
unknown word since it is not listed in the dictionary.
The procedure for recognizing this type of unknown
word is relatively simple, that is by simply splitting
words and then comparing them to words listed in the
dictionary.
Simple typos: For typos that are relatively simple to
solve, we use similarity algorithm module [3] to
compare an unknown word with words listed in the
dictionary. We implement gradual threshold value,
starting from 0.95 to 0.85. That is, if similarity
algorithm with threshold value 0.95 cannot detect an
unknown word, then the threshold value is reduced by
0.05 up to 0.85. If the unknown word is still cannot be
recognized although the threshold value is set to 0.85,
then edit distance algorithm [4] is used. It is a perl
module that measures degree of proximity between two
strings. The 'distance' is the number of substitutions,
deletions, or insertions. Now we set two as the value of
distance.
Concatenated words without typos: Sometimes there
are unknown words resulted from concatenation of two
or more words such as segalakekuatan (correct form is
segala kekuatan). For this kind of typo, firstly we
create a list consists of two consecutive words
extracted from the meta corpus. Then our script scans
and appends per character starting from the beginning
then compares the fragment with words in dictionary.
If a fragment is found in the dictionary, then the rest of
the unknown word is handled in the same way until all
fragments in the unknown word is recognized. Each
combination of two fragments resulted from this
method is compared with the list of two consecutive
words. if the combination exists, then our script moves
on trying to recognize the rest part of the unknown
word.
If the above method fails, our script tries scanning and
appending per character starting from the end of the
unknown word.
This method is effective as long as there is no typos in
the unknown word, and it can handle long typos such
as
pekerjaankantordapatdilayanidengansegalakeperluan
(correct form is pekerjaan kantor dapat dilayani
dengan segala keperluan)
Concatenated words with typos: For handling this kind
of typo (such as in suelamatpuagiuinodonesia; correct
form is selamat pagi indonesia), firstly we create a list
of possible multiple consecutive words (more than two
words) from the meta corpus. This needs a huge
amount of memory, and now we are only capable of
creating list of three consecutive words. Then our script
tries to resolve this type of typos by using similarity
algorithm that compares the unknown word with the
list of three consecutive words.
IV. CONCLUSION
A discussion on Indonesian monolingual meta corpus and its
management system, mainly of handling sentence
segmentation and typos is given. In the future, it is planned
that the corpus meta is categorized according to its topics so
that there will be corpus for various topics such as politics,
economics, sports, and so on. For handling typos resulted from
concatenated words, it is planned that a language model is
used so that the calculation procedures may be simplified by
looking at probability data of two consecutive words.
V. REFERENCES
[1] http://www.httrack.com/
[2] http://www.ethnologue.com/country/ID/default/***EDITION***
[3] http://search.cpan.org/~mlehmann/String-Similarity-1.04/Similarity.pm
[4] http://search.cpan.org/~jgoldberg/Text-LevenshteinXS-
0.03/LevenshteinXS.pm
[5] Hammam Riza and Oskar Riandi, Toward Asian Speech Translation
System: Developing Speech Recognition and Machine Translation for
Indonesian Language, Proceedings of the Workshop on Technologies
and Corpora for Asia-Pacific Speech Translation (TCAST), January,
2008, Hyderabad, India, p. 30--35, http://www.slc.atr.jp/
TCAST/TCAST2008/TCAST_Home.html}
[6] Sakriani Sakti, Eka Kelana, Hammam Riza, Shinsuke Sakai, Konstantin
Markov and Satoshi Nakamura, Development of Indonesian Large
Vocabulary Continuous Speech Recognition System within A-STAR
Project, Proceedings of the Workshop on Technologies and Corpora for
Asia-Pacific Speech Translation (TCAST), January 2008, Hyderabad,
India, p. 19--24
[7] Hammam Riza et. al, Initial Research Report on Corpus Design,
Collection and Cleaning Tools, PAN Localization Project Report,
Indonesia Country Component, 2009.
[8] Guy Aston and Lou Burnard , The BNC Handbook: Exploring the
British National Corpus with SARA, Edinburgh University Press, 1998.
... From several types of research on NLP (natural language processing) that we have done previously December) (Uliniansyah et al., 2017) (Uliniansyah et al., 2013), we collected a text corpus consisting of around 16 million unique sentences. However, the sentences are mostly not related to the medical domain. ...
Conference Paper
The pronunciation of foreign terms in our acoustic data for various dialects in Indonesia, such as Javanese, Sundanese, Batak, and Minangnese, have their own unique patterns when they are notated in a pronunciation lexicon. Based on the 2010 population census data by BPS (Badan Pusat Statistik; Statistics Indonesia), the Javanese people comprise 40 percent of the total population (Statistik, B. P., 2011). In this paper, we discuss the development of a speech corpus to examine the pronunciation patterns of foreign terms by Indonesians. It turned out that the number of Javanese speakers also made up a similar proportion in our speakers’ data. We propose a lexicon development method for an ASR (automatic speech recognition) modeling for medical dictation by mapping the pronunciation patterns of foreign terms. We mapped the pronunciation patterns of medical technical terms based on the recorded data of 122 speakers with various dialects. We identified speakers with Javanese dialects and made a custom lexicon file consisting of pronunciation data for the standard Indonesian and Javanese dialects. The experiment results show that the ASR model built with a combined standard Indonesian dialect and Javanese dialect lexicon has better accuracy than the ASR model made with a common Indonesian dialect lexicon. We hope that the proposed method can be used to build a lexicon for an ASR model intended for a multi dialects community.
... From several types of research on NLP (natural language processing) that we have done previously December) (Uliniansyah et al., 2017) (Uliniansyah et al., 2013), we collected a text corpus consisting of around 16 million unique sentences. However, the sentences are mostly not related to the medical domain. ...
Conference Paper
The global era has led to a fairly rapid change in language. Many words have become obsolete. There are also many words whose meaning becomes irrelevant nowadays. Unfortunately, in Indonesian dictionary, especially in Comprehensive Indonesian Dictionary (KBBI), there is no label for obsolete words. There are only archaic label to mark all outdated words and classic label to mark classical words. Another labeling problem in the KBBI is that there are no clear guidelines or criteria to determine when a word is considered archaic, obsolete, or classic. The absence of clear criteria causes some entries that have been labeled archaic in the KBBI to seem obsolete and sometimes classic words get confused with the archaic. The aim of this paper is to investigate how to categorize archaic, obsolete, and classical words in KBBI. This research was conducted by comparing several forms and entry criteria labeled archaic, obsolete, and classical in several dictionaries, especially dictionaries of foreign languages whose lexicographic tradition had been established. Each dictionary has its own criteria for classifying a word as archaic, obsolete, or classical, and we can learn from them. From the results of our study, we suggest that, for now, the best and easiest way to categorize words according to their labels is to check the corpus database.
Article
Full-text available
In this study, we present the first spell error corpus for the Indonesian Language (SPECIL). This corpus provides a comprehensive resource for researchers and practitioners to detect and correct spelling errors in Bahasa Indonesia (Indonesian). It should be emphasized that currently, there is no recognized corpus for identifying spelling mistakes in the Indonesian language that has been officially released or made accessible. This study also provides a systematic literature review to identify resources and methodologies for building a corpus for spelling error detection and correction in Indonesia. A corpus was created using a combination of manual and automatic methods. The results of this study are a review of publications relating to corpora and spelling, the novel algorithm of six types of spelling errors, and the production of a corpus comprising over 180,000 tokens in 21,500 sentences, including non-word, real-word, and punctuation errors. Using the developed corpus, various Natural Language Processing (NLP) models, including spell checkers and language models, can be trained and tested to enhance their accuracy and effectiveness in identifying and rectifying errors in Indonesian texts. Moreover, the corpus can be used to develop and evaluate new algorithms and techniques for spelling error detection and correction in Indonesia. The SPECIL corpus is publicly available and accessible. It is expected that SPECIL will inspire further research in this area and facilitate the development of more accurate and effective spelling error detection and correction tools in Indonesian language.
Conference Paper
Full-text available
The paper outlines the development of a large vocabulary continuous speech recog- nition (LVCSR) system for the Indonesian language within the Asian speech transla- tion (A-STAR) project. An overview of the A-STAR project and Indonesian language characteristics will be briefly described. We then focus on a discussion of the develop- ment of Indonesian LVCSR, including data resources issues, acoustic modeling, lan- guage modeling, the lexicon, and accuracy of recognition. There are three types of In- donesian data resources: daily news, tele- phone application, and BTEC tasks, which are used in this project. They are available in both text and speech forms. The Indonesian speech recognition engine was trained using the clean speech of both daily news and tele- phone application tasks. The optimum per- formance achieved on the BTEC task was 92.47% word accuracy.
  • Hammam Riza
Hammam Riza et. al, Initial Research Report on Corpus Design, Collection and Cleaning Tools, PAN Localization Project Report, Indonesia Country Component, 2009.
Toward Asian Speech Translation System: Developing Speech Recognition and Machine Translation for Indonesian Language
  • Hammam Riza
  • Oskar Riandi
Hammam Riza and Oskar Riandi, Toward Asian Speech Translation System: Developing Speech Recognition and Machine Translation for Indonesian Language, Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST), January, 2008, Hyderabad, India, p. 30--35, http://www.slc.atr.jp/