Conference PaperPDF Available

Stemming Indonesian

Authors:

Abstract and Figures

Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.
Content may be subject to copyright.
Stemming Indonesian
Jelita Asian Hugh E. Williams S.M.M. Tahaghoghi
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia.
{jelita,hugh,saied}@cs.rmit.edu.au
Abstract
Stemming words to (usually) remove suffixes has ap-
plications in text search, machine translation, docu-
ment summarisation, and text classification. For ex-
ample, English stemming reduces the words “com-
puter”, “computing”, “computation”, and “com-
putability” to their common morphological root,
“comput-”. In text search, this permits a search for
“computers” to find documents containing all words
with the stem “comput-”. In the Indonesian lan-
guage, stemming is of crucial importance: words
have prefixes, suffixes, infixes, and confixes that make
matching related words difficult. In this paper, we
investigate the performance of five Indonesian stem-
ming algorithms through a user study. Our results
show that, with the availability of a reasonable dictio-
nary, the unpublished algorithm of Nazief and Adri-
ani correctly stems around 93% of word occurrences
to the correct root word. With the improvements we
propose, this almost reaches 95%. We conclude that
stemming for Indonesian should be performed using
our modified Nazief and Adriani approach.
Keywords: stemming, Indonesian Information Re-
trieval
1 Introduction
Stemming is a core natural language processing
technique for efficient and effective Information Re-
trieval (Frakes 1992), and one that is widely accepted
by users. It is used to transform word variants to their
common root word by applying — in most cases —
morphological rules. For example, in text search, it
should permit a user searching using the query term
“stemming” to find documents that contain the terms
“stemmer” and “stems” because all share the com-
mon root word “stem”. It also has applications in
machine translation (Bakar & Rahman 2003), docu-
ment summarisation (Or˘asan, Pekar & Hasler 2004),
and text classification (Gaustad & Bouma 2002).
For the English language, stemming is well-
understood, with techniques such as those of
Lovin (1968) and Porter (1980) in widespread use.
However, stemming for other languages is less well-
known: while there are several approaches available
for languages such as French (Savoy 1993), Span-
ish (Xu & Croft 1998), Malaysian (Ahmad, Yusoff
& Sembok 1996, Idris 2001), and Indonesian (Arifin
& Setiono 2002, Nazief & Adriani 1996, Vega 2001),
Copyright c
2005, Australian Computer Society, Inc. This pa-
per appeared at the 28th Australasian Computer Science Con-
ference (ACSC2005), The University of Newcastle, Australia.
Conferences in Research and Practice in Information Technol-
ogy, Vol. 38. V. Estivill-Castro, Ed. Reproduction for aca-
demic, not-for profit purposes permitted provided this text is
included.
there is almost no consensus about their effectiveness.
Indeed, for Indonesian the schemes are neither easily
accessible nor well-explored. There are no compara-
tive studies that consider the relative effectiveness of
alternative stemming approaches for this language.
Stemming is essential to support effective Indone-
sian Information Retrieval, and has uses as diverse
as defence intelligence applications, document trans-
lation, and web search. Unlike English, a more com-
plex class of affixes — which includes prefixes, suf-
fixes, infixes (insertions), and confixes (combinations
of prefixes and suffixes) — must be removed to trans-
form a word to its root word, and the application
and order of the rules used to perform this process
requires careful consideration. Consider a simple ex-
ample: the word “minuman” (meaning “a drink”) has
the root “minum” (“to drink”) and the suffix “-an”.
However, many examples do not share the simple suf-
fix approach used by English:
“pemerintah” (meaning “government”) is de-
rived from the root “perintah” (meaning “‘gov-
ern”) through the process of inserting the infix
“em” between the “p-” and “-erintah” of “perin-
tah”.
“anaknya” (a possessive form of child, such as
“his/her child”) has the prefix “anak” (“child”)
and the suffix “nya” (a third person possessive)
“duduklah” (please sit) is “duduk” (“sit”) and
“lah” (a softening, equivalent to “please”).
“buku-buku” (books) is the plural of “buku”
(“book”)
These latter examples illustrate the importance of
stemming in Indonesian: without them, for example,
a document containing “anaknya” (“his/her child”)
does not match the term “anak” (“child”).
Several techniques have been proposed for stem-
ming Indonesian. We evaluate these techniques
through a user study, where we compare the per-
formance of the scheme to the results of manual
stemming by four native speakers. Our results show
that an existing technique, proposed by Nazief and
Adriani (1996) in an unpublished technical report,
correctly stems around 93% of all word occurrences
(or 92% of unique words). After classifying the fail-
ure cases, and adding our own rules to address these
limitations, we show this can be improved to 95%
for both unique and all word occurrences. We believe
that adding a more complete dictionary of root words
would improve these results even further. We con-
clude that our modified Nazief and Adriani stemmer
should be used in practice for stemming Indonesian.
2 Stemming Techniques
In this section, we describe the five schemes we have
evaluated for Indonesian stemming. In particular, we
detail the approach of Nazief and Adriani, which per-
forms the best in our evaluation of all approaches in
Section 4. We propose extensions to this approach in
Section 5.
2.1 Nazief and Adriani’s Algorithm
The stemming scheme of Nazief and Adriani is de-
scribed in an unpublished technical report from the
University of Indonesia (1996). In this section, we de-
scribe the steps of the algorithm, and illustrate each
with examples; however, for compactness, we omit
the detail of selected rule tables. We refer to this
approach as nazief.
The algorithm is based on comprehensive morpho-
logical rules that group together and encapsulate al-
lowed and disallowed affixes, including prefixes, suf-
fixes, infixes (insertions) and confixes (combination
of prefixes and suffixes). The algorithm also supports
recoding, an approach to restore an initial letter that
was removed from a root word prior to prepending
a prefix. In addition, the algorithm makes use of an
auxiliary dictionary of root words that is used in most
steps to check if the stemming has arrived at a root
word.
Before considering how the scheme works, we con-
sider the basic groupings of affixes used as a basis
for the approach, and how these definitions are com-
bined to form a framework to implement the rules.
The scheme groups affixes into categories:
1. Inflection suffixes the set of suffixes that
do not alter the root word. For example,
“duduk” (sit) may be suffixed with “-lah” to give
“duduklah” (please sit). The inflections are fur-
ther divided into:
(a) Particles (P) including “-lah” and
“-kah”, as used in words such as “duduklah”
(please sit)
(b) Possessive pronouns (PP) including
“-ku”, “-mu”, and “-nya”, as used in
“ibunya” (a third person possessive form of
“mother”)
Particle and possessive pronoun inflections can
appear together and, if they do, possessive pro-
nouns appear before particles. A word can have
at most one particle and one possessive pronoun,
and these may be applied directly to root words
or to words that have a derivation suffix. For
example, “makan” (to eat) may be appended
with derivation suffix “-an” to give “makanan”
(food). This can be suffixed with “-nya” to give
“makanannya” (a possessive form of “food”)
2. Derivation suffixes — the set of suffixes that are
directly applied to root words. There can be only
one derivation suffix per word. For example, the
word “lapor” (to report) can be suffixed by the
derivation suffix “–kan” to become “laporkan”
(go to report). In turn, this can be suffixed with,
for example, an inflection suffix “-lah” to become
“laporkanlah” (please go to report)
3. Derivation prefixes — the set of prefixes that are
applied either directly to root words, or to words
that have up to two other derivation prefixes.
For example, the derivation prefixes “mem-” and
“per-” may be prepended to “indahkannya” to
give “memperindahkannya” (the act of beautify-
ing)
The classification of affixes as inflections and
derivations leads to an order of use:
[DP+[DP+[DP+]]] root-word [[+DS][+PP][+P]]
Prefix Disallowed suffixes
be- -i
di- -an
ke- -i, -kan
me- -an
se- -i,-kan
te- -an
Table 1: Disallowed prefix and suffix combinations.
The only exception is that the root word “tahu” is
permitted with the prefix “ke-” and the suffix “-i”.
The square brackets indicate that an affix is optional.
The previous definition forms the basis of the rules
used in the approach. However, there are exceptions
and limitations that are incorporated in the rules:
1. Not all combinations are possible. For example,
after a word is prefixed with “di-”, the word is
not allowed to be suffixed with “-an”. A com-
plete list is shown in Table 1
2. The same affix cannot be repeatedly applied. For
example, after a word is prefixed with “te-” or
one of its variations, it is not possible to repeat
the prefix “te-” or any of those variations
3. If a word has one or two characters, then stem-
ming is not attempted.
4. Adding a prefix may change the root word or
a previously-applied prefix; we discuss this fur-
ther in our description of the rules To illus-
trate, consider “meng-” that has the variations
“mem-”,“meng-”,“meny-”, and “men-”. Some of
these may change the prefix of a word, for exam-
ple, for the root word “sapu” (broom), the vari-
ation applied is “meny-” to produce the word
“menyapu” (to sweep) in which the “s” is re-
moved
The latter complication requires that an effective In-
donesian stemming algorithm be able to add deleted
letters through the recoding process.
The algorithm itself employs three components:
the affix groupings, the order of use rules (and their
exceptions), and a dictionary. The dictionary is
checked after any stemming rule succeeds: if the resul-
tant word is found in the dictionary, then stemming
has succeeded in finding a root word, the algorithm
returns the dictionary word, and then stops; we omit
this lookup from each step in our rule listing. In ad-
dition, each step checks if the resultant word is less
than two characters in length and, if so, no further
stemming is attempted.
For each word to be stemmed, the following steps
are followed:
1. The unstemmed word is searched for in the dic-
tionary. If it is found in the dictionary, it is as-
sumed the word is a root word, and so the word
is returned and the algorithm stops.
2. Inflection suffixes (“-lah”, “-kah”, “-ku”, “-mu”,
or “-nya”) are removed. If this succeeds and
the suffix is a particle (“-lah” or “-kah”), this
step is again attempted to remove any inflec-
tional possessive pronoun suffixes (“-ku”, “-mu”,
or “-nya”).
3. Derivation suffix (“-i” or “-an”) removal is at-
tempted. If this succeeds, Step 4 is attempted.
If Step 4 does not succeed:
(a) If “-an” was removed, and the final letter
of the word is “-k”, then the “-k” is also
removed and Step 4 is re-attempted. If that
fails, Step 3b is performed.
Following Characters Prefix type
Set 1 Set 2 Set 3 Set 4
“-r-” “-r-” – – none
“-r-” vowel – – ter-luluh
“-r-” not (“-r-” or vowel) “-er-” vowel ter
“-r-” not (“-r-” or vowel) “-er-” not vowel none
“-r-” not (“-r-” or vowel) not “-er-” ter
not (vowel or “-r-”) “-er-” vowel none
not (vowel or “-r-”) “-er-” not vowel te
Table 2: Determining the prefix type for words prefixed with “te–”. If the prefix “te-” does not match one of
the rules in the table, then “none” is returned. Similar rules are used for “be–”, “me-”, and “pe-”.
(b) The removed suffix (“-i”, “-an”, or “-kan”)
is restored.
4. Derivation prefix removal is attempted. This has
several sub-steps:
(a) If a suffix was removed in Step 3, then disal-
lowed prefix-suffix combinations are checked
using the list in Table 1. If a match is found,
then the algorithm returns.
(b) If the current prefix matches any previous
prefix, then the algorithm returns.
(c) If three prefixes have previously been re-
moved, the algorithm returns.
(d) The prefix type is determined by one of the
following steps:
i. If the prefix of the word is “di-”, “ke-”,
or “se-”, then the prefix type is “di”,
“ke”, or “se” respectively.
ii. If the prefix is “te-”, “be-”, “me-”, or
“pe-”, then an additional process of ex-
tracting character sets to determine the
prefix type is required. As an example,
the rules for the prefix “te-” are shown
in Table 2. Suppose that the word be-
ing stemmed is “terlambat” (late). Af-
ter removing “te-” to give “-rlambat”,
the first set of characters are extracted
from the prefix according to the “Set 1”
rules. In this case, the letter following
the prefix “te-” is “r”, and this matches
the first five rows of the table. Follow-
ing “-r-” is “-l-” (Set 2), and so the
third to fifth rows match. Following
“-l-” is “-ambat”, eliminating the third
and fourth rows for Set 3 and deter-
mining that the prefix type is “ter-” as
shown in the rightmost column.
iii. If the first two characters do not match
“di-”, “ke-”, “se-”, “te-”, “be-”, “me-”,
or “pe-” then the algorithm returns.
(e) If the prefix type is “none”, then the al-
gorithm returns. If the prefix type is not
“none”, then the prefix type is found in
Table 3, the prefix to be removed found,
and the prefix removed from the word; for
compactness, Table 3 shows only the simple
cases and those matching Table 2.
(f) If the root word has not been found, Step 4
is recursively attempted for further prefix
removal. If a root word is found, the algo-
rithm returns.
(g) Recoding is performed. This step de-
pends on the prefix type, and can result
in different prefixes being prepended to the
stemmed word and checked in the dictio-
nary. For compactness, we consider only
the case of the prefix type “ter-luluh” shown
Prefix type Prefix to be removed
di di-
ke ke-
se se-
te te-
ter ter-
ter-luluh ter-
Table 3: Determining the prefix from the prefix type.
Only simple entries, and those for the te- prefix type
are shown.
in Tables 2 and 3. In this case, after re-
moving “ter-”, an “r-” is prepended to the
word. If this new word is not in the dictio-
nary, Step 4 is repeated for the new word.
If a root word is not found, then “r-” is re-
moved and “ter-” restored, the prefix is set
to “none”, and the algorithm returns.
5. Having completed all steps unsuccessfully, the al-
gorithm returns the original word.
2.2 Arifin and Setiono’s Algorithm
Arifin and Setiono (2002) propose a less complex
scheme than that of Nazief and Adriani, but one that
follows a similar approach of using a dictionary, pro-
gressively removing affixes, and dealing with recod-
ing. We refer to this approach as arifin. In sum-
mary, their approach attempts to remove all prefixes
and then all suffixes, stopping when the stemmed
word is found in a dictionary or the number of af-
fixes removed reaches a maximum of two prefixes and
three suffixes. If the stemmed word cannot be found
after prefix and suffix removal is completed, affixes
are restored to the word in all possible combinations
so that possible stemming errors are minimised.
The particular advantage of this approach is that,
if the word cannot be found after the removal of
all affixes, the algorithm then tries all combinations
of removed affixes. This process helps avoid over-
stemming when some parts of the words can be
mistaken as prefixes or suffixes. For example, the
word “diselamatkan” (to be saved) has the root word
“selamat” (safe). After the first step of removing the
prefix “di-”, the remaining word is “selamatkan” (to
save). Since the word is not found in the dictionary,
the algorithm removes “se-” (another valid prefix) to
give “lamatkan”; however, in this case, this is a mis-
take because “se” is not a prefix, but rather part of
the root word. With prefix removal complete, the
next step removes the suffix “-kan” to give “lamat”,
which is not a valid root word. The scheme then tries
combinations of restoring prefixes and suffixes and,
after restoring the prefix “se-”, creates “selamat” and
the result is a successful dictionary lookup.
The disadvantages of the scheme are two-fold.
First, repeated prefixes and suffixes are allowed; for
example, the word “peranan” (role, part) would be
overstemmed to “per” (spring) instead of the cor-
rect root word “peran” (to play the role of). Sec-
ond, the order of affix removal — prefix and then
suffix — can lead to incorrect stemming; for exam-
ple, the word “memberikan” (“to give away” in active
verb form) has the root word “beri”(to give away) but
the algorithm removes “mem-” and then “ber-” (both
valid prefixes) to give “ikan” (fish). The algorithm of
Nazief and Adriani does not share these problems.
2.3 Vega’s Algorithm
The approach of Vega (2001) is distinctly different
because it does not use a dictionary. As we show
later, this reduces its accuracy compared to other ap-
proaches, mostly because it requires that the rules be
correct and complete, and it prevents ad hoc restora-
tion of combinations of affixes and exploring whether
these match root words. In addition, the approach
does not apply recoding.
In brief, the approach works as follows. For each
word to be stemmed, rules are applied that attempt
to segment the word into smaller components. For ex-
ample, the word “didudukkan” (to be seated) might
be considered against the following rule: (di) +
stem(root) +(i |kan). This checks if the word begins
with “di-”, ends with “-i” or “-kan”.
There are four variants of this algorithm: stan-
dard, extended, iterative standard, and iterative ex-
tended. Standard deals with standard affix removal of
prefixes such as “ber-”, “di-”, and “ke-”, the suffixes
“-i”, “-an”, and “-nya”, and the infixes “-el-”, “-em-”,
“-er-”. In contrast, extended — unlike all other ap-
proaches described in this paper — deals with non-
standard affixes used in informal spoken Indonesian.
The iterative versions recursively stem words. In our
results, we report results with only the first scheme,
which we refer to as vega1; the other variants are
ineffective, performing between 10%–25% worse than
vega1.
2.4 Ahmad, Yusoff, and Sembok’s Algorithm
This approach (Ahmad et al. 1996) has two distinct
differences to the others: first, it was developed for
the closely-related Malaysian language, rather than
Indonesian; and, second, it does not progressively ap-
ply rules (we explain this next). We have addressed
the former issue by replacing the Malaysian dictio-
nary used by the authors with the Indonesian dictio-
naries discussed later. In practice, however, we could
have done more to adapt the scheme to Indonesian:
the sets of affixes are different between Indonesian
and Malaysian, some rules are not applied in Indone-
sian, and some rules applicable to Indonesian are not
used in Malaysian. We, therefore, believe that the re-
sults we report represent a baseline performance for
this scheme but it unclear how much improvement is
possible with additional work.
The algorithm itself is straightforward. An or-
dered list of all valid prefixes, suffixes, infixes, and
confixes is maintained. Before it begins, the algo-
rithm searches for the word in the dictionary and
simply returns this original word if the lookup suc-
ceeeds. If the word is not in the dictionary, then the
next rule in the rule list is applied to the word. If the
rule succeeds — for example, the word begins with
“me-” and ends with “-kan” — then the affixes are
removed and the stemmed word checked against the
dictionary. If it is found, then the stemmed word is
returned; if it is not found, then the stemmed word is
discarded and the next rule is applied to the original
word. If all rules fail to discover a stemmed word,
then the original word is returned. The advantage of
not progressively applying rules is that overstemming
is minimised. In addition, similarly to other success-
ful approaches, the scheme supports recoding.
Because the scheme is not progressive, its accu-
racy depends closely on the order of the rules. For
example, if infix rules are listed before prefix rules,
“berasal” (to come from) — for which the correct
stem is “asal” (origin or source) — is stemmed to
“basal” (basalt or dropsy) by removing the infix “-er-”
before the prefix “ber-”. Ahmad et al. have exper-
imented with several variations of the order of the
rules to explore the false positives generated by each
combination.
We report results with only one rule ordering —
which we refer to as ahmad2a — as we have found
that this is the most effective of the seven orderings,
and there is little difference between the variants; the
other schemes either perform the same — in the case
of ahmad2b — or at most 1% worse, in the case of
ahmad1.
Idris (2001) has further extended the scheme of
Ahmad et al. In the extension, a different recoding
scheme and progressive stemming is applied. Similar
to the basic approach, the algorithm checks the dic-
tionary after each step, stopping and returning the
stemmed word if it is found. The scheme works as
follows: first, after checking for the word in the dic-
tionary, the algorithm tests if the prefix of the word
matches a prefix that may require recoding; second, if
recoding is required, it is performed and the resultant
word searched for in the dictionary, while if recoding
is not required, the prefix is removed as usual and
the word checked in the dictionary; third, having re-
moved a prefix, suffix removal is attempted and, if it
succeeds, the word is checked in the dictionary; and,
last, the algorithm returns to the second step with
the partially stemmed word.
There are two variants of this algorithm: the first
changes prefixes and then performs recoding, while
the second does the reverse. In our results, we report
results with only the second scheme, which we refer
to as idris2; the other variant performs only 0.3%
worse under all measures.
3 Experiments
To investigate the performance of stemming schemes,
we have carried out a large user experiment. In this,
we compared the results of stemming with each of the
algorithms to manual stemming by native Indonesian
speakers. This section explains the collection we used,
the experimental design, and presents our results.
3.1 Collection
The collection of words to be stemmed were taken
from news stories at the Kompas online newspaper1.
In all, we obtained 9,901 documents. From these, we
extracted every fifth word occurrence and kept those
words that were longer than five characters in length;
our rationale for the minimum length is that shorter
words rarely require stemming, that is, they are al-
ready root words. Using this method, our word col-
lection contained 3,986 non-unique words and 1,807
unique words.
We chose to extract non-unique words to reflect
the real-world stemming problem encountered in text
search, document summarisation, and translation.
The frequency of word occurrences is highly skew:
in English, for example, “the” appears about twice
1See: http://www.kompas.com/
A B C D
A 3,674 3,689 3,564
B 3,588 3,555
C – 3,528
Table 4: Results of manual stemming by four Indone-
sian native speakers, denoted as A to D. The values
shown are the number of cases out of 3,986 where
participants agree.
as often as the next most common word; a similar
phenomena exists in Indonesian, where “yang” (a rel-
ative pronoun that is similar to “who”, “which”, or
“that”, or “the” if used with an adjective) is the most
common word. Given this skew, it is crucial that com-
mon words are correctly stemmed but less so that rare
words are.
We use the collection in two ways. First, we inves-
tigate the error rate of stemming algorithms relative
to manual stemming for the non-unique word collec-
tion. This permits quantifying the overall error rate
of a stemmer for a collection of real-world documents,
that is, it allows us to discover the total errors made.
Second, we investigate the error rate of stemming for
unique words only. This allows us to investigate how
many different errors each scheme makes, that is, the
total number of unique errors. Together, these allow
effective assessment of stemming accuracy.
3.2 Baselines
We asked four native Indonesian speakers to manually
stem each of the 3,986 words. The words were listed
in their order of occurrence, that is, repeated words
were distributed across the collection and words were
not grouped by prefix. Table 4 shows the results:
as expected, there is no consensus as to the root
words between the speakers and, indeed, the agree-
ment ranges from around 93% (for speakers A and
C) to less then 89% (for C and D). For example, the
word “bagian” (part) is left unstemmed in some cases
and stemmed to “bagi” (divide) in others, and simi-
larly “adalah” (to be) is sometimes stemmed to “ada”
(exists) and sometimes left unchanged. Indeed, the
latter example illustrates another problem: in some
cases, a speaker was inconsistent, on some occasions
stemming “adalah” to “ada”, and on others leaving
it unchanged.
Having established that native speakers disagree
and also make errors, we decided to use the major-
ity decision as the correct answer. Table 5 shows the
number of cases where three and four speakers agree.
All four speakers agree on only 82.6% of word oc-
curences while, at the other extreme, speakers B, C,
and D agree on 84.3%. The number of cases where
any three or all four speakers agree (shown as “Any
three”) is 95.3%. We use this latter case as our first
baseline to compare to automatic stemming: if a ma-
jority agree then we keep the original word in our
collection and note its answer as the majority deci-
sion; words that do not have a majority stemming
decision are omitted from the collection. We refer to
this baseline collection of 3,799 words as majority.
Interestingly, even the majority make errors. How-
ever, this is rare and so we accept the majority deci-
sion. For example, for “penebangan” (felling; cutting
down; chopping down), the correct stem is “tebang”
(to cut down; to chop down). The majority misread
the word as the more common “penerbangan” (flight;
flying), and stemmed it to “terbang” (to fly).
The correct stems are sometimes ambiguous; for
example, the suffix “-kan” can be removed from
“gerakan” (movement) to give “gera” (to frighten
ABCD ABC ABD ACD BCD Any three
3,292 3,493 3,413 3,408 3,361 3,799
Table 5: Consensus and majority agreement for man-
ual stemming by four Indonesian native speakers, de-
noted as A to D. The values shown are the number of
cases out of 3,986 where participants agree.
or threaten), or the suffix “-an” can be removed to
give “gerak” (to move); both are correct. We found
that all four human subjects stemmed this word to
“gerak”.
There are 1,751 unique unstemmed words used to
create the majority collection. However, after stem-
ming, the unique stemmed word collection that is
agreed by the majority has 1,753 words. This increase
of 2 words is due to cases such as “adalah” remaining
unstemmed by 3 out of 4 speakers in some cases and
being stemmed by 3 out of 4 to “ada” in other cases.
We refer to the collection of unique stemmed words
that are agreed by the majority as unique, and we
use this as our second baseline for automatic stem-
ming.
We have also investigated the performance of
automatic stemming schemes when the complete
collection is used and they stem to any of the manual
stemming results. For example, consider a case where
the word “spiritual” (spiritual) is stemmed by two
speakers to “spiritual”, by a third to “spirit” (spirit),
and the fourth to “ritual” (ritual). In this case, if an
automatic approach stems to any of the manual three
stems, we deem it has correctly stemmed the word.
Not surprisingly, all automatic schemes perform
better under this measure. However, the results do
not show any change in relative performance between
the schemes and we omit the results from this paper
for compactness.
4 Results
Table 6 shows the results of our experiments using
the majority and unique collections. The nazief
scheme works best: it correctly stems 93% of word oc-
currences and 92% of unique words, making less than
two-thirds of the errors of the second-ranked scheme,
ahmad2a. The remaining dictionary schemes —
ahmad2a,idris2, and arifin — are comparable and
achieve 88%–89% on both collections. The only non-
dictionary scheme, vega1, makes almost five times as
many errors on the majority collection as nazief,
illustrating the importance of validating decisions us-
ing an external word source.
Interestingly, the idris2 approach offers no im-
provement to the ahmad scheme on which it is based.
On the unique collection, idris2 is 0.5% or eight
words better than ahmad2a. However, on the ma-
jority collection, idris2 is 0.9% or 34 words worse
than ahmad2a. This illustrates an important char-
acteristic of our experiments: stemming algorithms
should be considered in the context of word occur-
rences and not unique words. While idris2 makes
less errors on rare words, it makes more errors on
more common words, and is less effective overall for
stemming document collections.
The performance of the nazief scheme is impres-
sive and, for this reason, we focus on it in the remain-
der of this paper. Under the strict majority model
— where only one answer is allowed — the scheme
incorrectly stems less than 1 in 13 words of longer
than 5 characters; in practice, when short words are
included, this is an error rate of less than 1 in 21
Stemmer majority unique
Correct Errors Correct Errors
(%) (words) (%) (words)
nazief 92.8 272 92.1 139
ahmad2a 88.8 424 88.3 205
idris2 87.9 458 88.8 197
arifin 87.7 466 88.0 211
vega1 66.3 1,280 69.4 536
Table 6: Automatic stemming performance compared
to the majority and unique baseline collections.
word occurrences. However, there is still scope for
improvement: even under a model where all 3,986
word occurrences are included and any answer pro-
vided by a native speaker is deemed correct, the al-
gorithm achieves only 95%. Considering both cases,
therefore, there is scope for an at least 5% improve-
ment in performance by eliminating failure cases and
seeking to make better decisions in non-majority de-
cision cases. We consider and propose improvements
in the next section.
5 Improving the Nazief and Adriani Stem-
mer
In this section, we discuss the reasons why the nazief
scheme works well, and what aspects of it can be im-
proved. We present a detailed analysis of the failure
cases, and propose solutions to these problems. We
then present the results of including these improve-
ments, and describe our modified nazief approach.
5.1 Discussion
The performance of the nazief approach is perhaps
unsurprising: it is by far the most complex approach,
being based closely on the detailed morphological
rules of the Indonesian language. In addition, it sup-
ports dictionary lookups and progressive stemming,
allowing it to evaluate each step to test if a root word
has been found and to recover from errors by restor-
ing affixes to attempt different combinations. How-
ever, despite these features, the algorithm can still be
improved.
In Table 7, we have classified the failures made
by the nazief scheme on the majority collection.
The two most significant faults are dictionary related:
around 33% of errors are the result of non-root words
being in the dictionary, and around 11% are the result
of root words not being in the dictionary. Hyphen-
ated words — which we discuss in more detail in the
next section — contribute 15.8% of the errors. Of
the remaining errors, around 49 errors or 18% are re-
lated to rules and rule precedence. The remaining
errors are foreign words, misspellings, acronyms, and
proper nouns.
In summary, three opportunities exist to improve
stemming with nazief. First, a more complete and
accurate root word dictionary may reduce errors. Sec-
ond, features can be added to support stemming
of hyphenated words. Last, new rules and adjust-
ments to rule precedence may reduce over- and under-
stemming, as well as support affixes not currently
catered for in the algorithm. We discuss the improve-
ments we propose in the next section.
5.2 Improvements
To address the limitations of the nazief scheme, we
propose the following improvements:
1. Using a more complete dictionary — we have
experimented with two other dictionaries, and
present our results later.
2. Adding rules to deal with plurals — when plu-
rals, such as “buku-buku” (books) are encoun-
tered, we propose stemming these to “buku”
(book). However, care must be taken with other
hyphenated words such as “bolak-balik” (to and
fro), “berbalas-balasan” (mutual action or inter-
action) and “seolah-olah” (as though). For these
later examples, we propose stemming the words
preceding and following the hyphen separately
and then, if the words have the same root word,
to return the singular form. For example, in the
case of “berbalas-balasan”, both “berbalas” and
“balasan” stem to “balas” (response or answer),
and this is returned. In contrast, the words
“bolak” and “balik” do not have the same stem,
and so “bolak-balik” is returned as the stem; in
this case, this is the correct action, and this works
for many hyphenated non-plurals.
3. Adding prefixes and suffixes, and additional
rules:
(a) Adding the particle (inflection suffix)
“-pun”2. This is used in words such as
“siapapun” (where the root word is “siapa”
[who]).
(b) For the prefix type “ter”, we have modi-
fied the conditions so that row 4 in Table 2
sets the type to “ter” instead of “none”.
This supports cases such as “terpercaya”
(the most trusted), which has the root word
“percaya” (believe).
(c) For the prefix type “pe”, we have modi-
fied the conditions (similar to those listed
in Table 2) so that words such as “pekerja”
(worker) and “peserta” (member) have pre-
fix type “pe”, instead of the erroneous
“none”.
(d) For the prefix type “mem”, we have modi-
fied the conditions so that words beginning
with the prefix “memp-” are of type “mem”.
(e) For the prefix type “meng”, we have mod-
ified the conditions so that words begin-
ning with the prefix “mengk-” are of type
“meng”.
4. Adjusting rule precedence:
(a) If a word is prefixed with “ber-” and suf-
fixed with the inflection suffix “-lah”, try
to remove prefix before the suffix. This
addresses problems with words such as
“bermasalah” ([having a problem] where
the root word is “masalah” [problem]) and
“bersekolah” ([be at school] where the root
word is “sekolah”[school]).
(b) If a word is prefixed with “ber-” and suffixed
with the derivation suffix “-an”, try to re-
move prefix before the suffix. This solves
problems with, for example, “berbadan”
([having the body of] the root word is
“badan” [body]).
(c) If a word is prefixed with “men-” and suf-
fixed with the derivation suffix “-i”, try to
remove prefix before the suffix. This solves
problems with, for example, “menilai” ([to
mark] the root word is “nilai” [mark]).
2The inflection suffix “-pun” is mentioned in the technical report
but is not included by Nazief and Adriani in their implementation.
Examples
Fault Class Original Error Correct Total Cases
Non-root words in dictionary sebagai sebagai bagai 91
Hyphenated words buku-buku buku-buku buku 43
Incomplete dictionary bagian bagi bagian 31
Misspellings penambahanan penambahanan tambah 21
Incomplete affix rules siapapun siapapun siapa 20
Overstemming berbadan bad badan 19
Peoples’ names Abdullah Abdul Abdullah 13
Names minimi minim minimi 9
Combined words pemberitahuan pemberitahuan beritahu 7
Recoding ambiguity (dictionary related) berupa upa rupa 7
Acronyms pemilu milu pemilu 4
Recoding ambiguity (rule related) peperangan perang erang 2
Other sekali sekali kali 2
Understemming mengecek ecek cek 1
Foreign words mengakomodir mengakomodir akomodir 1
Human error penebangan terbang tebang 1
Total 272
Table 7: Classified failure cases of the nazief stemmer on the majority collection. The total shows the total
occurrences, not the number of unique cases.
Stemmer majority unique
Correct Errors Correct Errors
(%) (words) (%) (words)
Original 92.8 272 92.1 139
(A) Alternative KBBI dictionary 88.8 426 86.9 229
(B) Alternative Online dictionary 93.8 236 92.5 131
(C) Adding repeated word rule 93.9 232 94.0 105
(D) Changes to rule precedence 93.3 255 92.7 128
(E) Adding additional affixes 93.3 253 92.8 127
(F) Combining (C) + (D) + (E) 94.8 196 95.3 82
Table 8: Improvements to the nazief stemmer, measured with the majority and unique baseline collections.
(d) If a word is prefixed with “di-” and suffixed
with the derivation suffix “-i”, try to remove
prefix before the suffix. This solves prob-
lems with, for example, “dimulai” ([to be
started] the root word is “mulai” [start]).
(e) If a word is prefixed with “pe-” and suffixed
with the derivation suffix “-i”, try to remove
prefix before the suffix. This solves prob-
lems with, for example, “petani” ( [farmer]
the root word is “tani” [farm]).
(f) If a word is prefixed with “ter-” and suffixed
with the derivation suffix “-i”, try to remove
prefix before the suffix. This solves prob-
lems with, for example, “terkendali” ([can
be controlled] the root word is “kendali”
[control]).
We present results with these improvements in the
next section.
5.3 Results
Table 8 shows the results of our improvements
to the nazief stemmer. Using a different, well-
curated dictionary does not guarantee an improve-
ment: the second and third rows show the result when
the 29,337 word dictionary used in developing the
original nazief approach is replaced with the 27,828
word Kamus Besar Bahasa Indonesia (KBBI) dic-
tionary and with an online dictionary of unknown
size3. Despite the KBBI dictionary being perhaps
more comprehensive than the original, performance
actually drops by 4.0% on the majority collection,
3See: http://nlp.aia.bppt.go.id/kebi/
and 5.2% on the unique words. We believe this is due
to three factors: first, dictionaries often contain un-
stemmed words and, therefore, can cause stemming to
stop before the root word is found; second, the dictio-
nary is only part of the process and its improvement
addresses only some of the failure cases; and, last,
inclusion of new, rare words can cause matches with
incorrectly or overstemmed common words, leading
to decreases in performance for some cases while still
improving others. To test our other improvements,
we used only the original dictionary.
The fourth, fifth, and sixth rows show the effect of
including the algorithmic improvements we discussed
in the previous section. The results show the accu-
racy gains of including only the improvement into the
original version, while the final row shows the additive
effect of including all three. Dealing with repeated
words improves the majority result by 1.1% and
the unique result by 1.9%. Adjustments to the rule
precedence improves the results by 0.5% and 0.6%
on the two collections, and adding additional affixes
improves results by 0.5% on majority and 0.7% on
unique . The combined effect of the three improve-
ments lowers the error rate to 1 in 19 words of 5 or
more characters in length, or an average of only 1
error every 38 words in the original Kompas collec-
tion. The overall outcome is highly effective Indone-
sian stemming using our modified nazief stemmer.
6 Conclusion
Stemming is an important Information Retrieval tech-
nique. In this paper, we have investigated Indonesian
stemming and, for the first time, presented an exper-
imental evaluation of Indonesian stemmers. Our re-
sults show that a successful stemmer is complex, and
requires the careful combination of several features:
support for complex morphological rules, progressive
stemming of words, dictionary checks after each step,
trial-and-error combinations of affixes, and recoding
support after prefix removal.
Our evaluation of stemmers followed a user study.
Using four native speakers and a newswire collection,
we evaluated five automatic stemmers. Our results
show that the nazief stemmer is the most effec-
tive scheme, making less than 1 error in 21 words
on newswire text. With detailed analysis of failure
cases and modifications, we have improved this to less
than 1 error in 38 words. We conclude that the mod-
ified nazief stemmer is a highly effective tool.
We intend to continue this work. We will improve
the dictionaries by curating them to remove non-root
and add root words. We also plan to further extend
the nazief stemmer to deal with cases where the root
word is ambiguous.
Acknowledgments
We thank Bobby Nazief for providing source code and
the dictionary used in this paper, Vinsensius Berlian
Vega for source code, Riky Irawan for his Indonesian
corpus, and Gunarso for the Kamus Besar Bahasa
Indonesia (KBBI) dictionary. We also thank Wahyu
Wibowo for his help in answering our queries and Eric
Dharmazi, Agnes Julianto, Iman Suyoto, and Hendra
Yasuwito for their help in manually stemming our col-
lection. This work is supported by the Australian
Research Council.
References
Ahmad, F., Yusoff, M. & Sembok, T. M. T. (1996),
‘Experiments with a Stemming Algorithm for
Malay Words’, Journal of the American Society
for Information Science 47(12), 909–918.
Arifin, A. Z. & Setiono, A. N. (2002), Classification of
Event News Documents in Indonesian Language
Using Single Pass Clustering Algorithm, in ‘Pro-
ceedings of the Seminar on Intelligent Technol-
ogy and its Applications (SITIA)’, Teknik Elek-
tro, Sepuluh Nopember Institute of Technology,
Surabaya, Indonesia.
Bakar, Z. A. & Rahman, N. A. (2003), Evaluating the
effectiveness of thesaurus and stemming meth-
ods in retrieving Malay translated Al-Quran doc-
uments, in T. M. T. Sembok, H. B. Zaman,
H. Chen, S.R.Urs & S. Myaeng, eds, ‘Digital Li-
braries: Technology and Management of Indige-
nous Knowledge for Global Access’, Vol. 2911
of Lecture Notes in Computer Science, Springer-
Verlag, pp. 653 – 662.
Frakes, W. (1992), Stemming algorithms, in
W. Frakes & R. Baeza-Yates, eds, ‘Informa-
tion Retrieval: Data Structures and Algorithms’,
Prentice-Hall, chapter 8, pp. 131–160.
Gaustad, T. & Bouma, G. (2002), ‘Accurate stem-
ming of Dutch for text classification’, Language
and Computers 45(1), 104–117.
Idris, N. (2001), Automated Essay Grading System
Using Nearest Neighbour Technique in Infor-
mation Retrieval, Master’s thesis, University of
Malaya.
Lovins, J. (1968), ‘Development of a stemming al-
gorithm’, Mechanical Translation and Computa-
tion 11(1-2), 22–31.
Nazief, B. A. A. & Adriani, M. (1996), Confix-
stripping: Approach to Stemming Algorithm for
Bahasa Indonesia. Internal publication, Faculty
of Computer Science, University of Indonesia,
Depok, Jakarta.
Or˘asan, C., Pekar, V. & Hasler, L. (2004), A compar-
ison of summarisation methods based on term
specificity estimation, in ‘Proceedings of the
Fourth International Conference on Language
Resources and Evaluation (LREC2004)’, Lisbon,
Portugal, pp. 1037 – 1041.
Porter, M. (1980), ‘An algorithm for suffix stripping’,
Program 13(3), 130–137.
Savoy, J. (1993), ‘Stemming of French words based on
grammatical categories’, Journal of the Ameri-
can Society for Information Science 44(1), 1–9.
Vega, V. B. (2001), Information Retrieval for the
Indonesian Language, Master’s thesis, National
University of Singapore.
Xu, J. & Croft, W. (1998), ‘Corpus-based stem-
ming using cooccurrence of word variants’, ACM
Transactions on Information Systems 16(1), 61–
81.
... The text pre-processing phase involves a few steps: case folding, tokenizing, stopword removal, and stemming. Several algorithms have been developed for pre-processing text, especially Indonesian; these are the Nazief and Adriani algorithm 22 , Yusof and Sembok algorithm 23 , Arifin and Setiani algorithm 24 , and Vega Ahmad algorithm 24 . The Nazief and Adriani algorithm is better than the other three algorithms 24 . ...
... Several algorithms have been developed for pre-processing text, especially Indonesian; these are the Nazief and Adriani algorithm 22 , Yusof and Sembok algorithm 23 , Arifin and Setiani algorithm 24 , and Vega Ahmad algorithm 24 . The Nazief and Adriani algorithm is better than the other three algorithms 24 . Therefore, in this study, the Nazief and Adriani algorithm is utilised in the preprocessing stage, with the NLTK package ("Indonesian" stopword) and Sastrawi for Stemming the word into root word. ...
Article
Full-text available
Patent literature research has a high scientific value for the industrial, commercial, legal, and policymaking communities. Therefore, patent analysis has become crucial. Patent topic classification is an important process in patent topic modeling analysis. However, the classification process is time-consuming and expensive, as it is usually carried out manually by an expert. Moreover, a patent document may be categorised in more than one category or label, further complicating the task. As the number of patent documents submitted increases, creating an automated patent classification system that yields accurate results becomes increasingly critical. Therefore, in this paper, we analyse the performance of two algorithms with regard to multi-label classification in patent documents: multi-label k-nearest neighbor (ML-KNN) and classifier chain k-nearest neighbor (CC-KNN), combined with latent Dirichlet allocation (LDA). These two methods have a considerable advantage in handling the continuously updated dataset; they also exhibit superior performance compared to other multi-label learning algorithms. This study also compares these two algorithms with the term frequency (TF)-weighting measure. The optimal value obtained is based on the following evaluation parameters: micro F1, accuracy, Hamming loss, and one error. The result shows that the ML-KNN method is better than the CC-KNN method and that the multi-label classification based on topics (patent LDA) is better than the TF-weighting technique.
Article
A good lecture is certainly a goal so that students achieve maximum learning outcomes. In order for good lecture quality, lecture evaluation needs to be done,beside lecturer professional competency training. In order to improve the quality of lectures, Departement Informatics of Madura University (UNIRA) evaluates lecturers' performance in each semester. Form of evaluation is a questionnaire that filled out by students.Results of the questionnaire, then it is analyzed to find out whether the comments are positive, negative, or neutral. The method that can be used to solve the problem of sentiment classification analysis is Naïve Bayes that combined with text processing techniques.The data comments that collected are 342. After grouping the comments by subject, there were 31 comments for subject Human and Computer Interaction (HCI). In this data comments then performed data cleaning, data transformation, text processing and labeling. Then classifying comments using Naïve Bayes with Smoothing Laplace. Results of accuration obtained an accuracy to 80%. The results of implementation Naïve Bayes algorithm with Smoothing Laplace, it can be seen the sentiment analysis of the subjects that lectures taught.
Article
Full-text available
Automatic short answer scoring methods have been developed with various algorithms over the decades. In the Indonesian language, the string-based similarity is more commonly used. This method is difficult to accurately measure the similarity of two sentences with significantly different word lengths. This problem has been handled by the Geometric Average Normalized-Longest Common Subsequence (GAN-LCS) method by eliminating non-contributive words utilizing the Longest Common Subsequence method. However, students’ answers may vary not only in character length but also in the words they choose. For instance, some students tend only to write the abbreviations or acronyms of the phrase instead of writing meaningful words. As a result, it will reduce the intersection character between the reference answer and the student answer. Moreover, it can change the sentence structure even though it has the same meaning by definition. Therefore, this study aims to improve GAN-LCS method performance by incorporating the abbreviation checker to handle the abbreviations or acronyms found in the reference answer or student answer. The dataset used in this study consisted of 10 questions with 1 reference answer for each question and 585 student answers. The experimental results show an improvement in GAN-LCS performance that could run 34.43% faster. Meanwhile, the Root Mean Square Error (RSME) value became lower by 7.65% and the correlation value was increased by 8%. Looking forward, future studies may continue to investigate a method for automatically generate the abbreviations dictionary.
Conference Paper
Social-media is a very effective communication media in today’s digital era. Twitter is one of them which widely used by Internet users. Huge number of tweets has encouraged research in the field of text mining, especially in sentiment analysis. Most of sentiment analysis researches which mined data in Bahasa used TF-IDF to assign weight on every word in corpus. This traditional method resulted low accuracy when tested using machine learning methods. In this study, instead of using TF-IDF, we implemented Bat Algorithm to weight every word in corpus. We tested this on Naïve Bayes, Decision Tree and K-NN methods. The result of this study shows that Naïve Bayes, Decision Tree and K-NN methods which classified data weighted using TF-IDF reached accuracy 33.58%, 32.82% and 33.61%, respectively. Afterwards, words in corpus were weighted using Bat Algorithm and tested using the same methods. The test result shows that Naïve Bayes, Decision Tree and K-NN methods reached 39.01%, 76.63% and 66.15% in respectively. It can be inferred that Bat Algorithm usage for weighting words in corpus improves machine learning algorithms to classify sentiment of Twitter users. Moreover, it can be identified that the biggest improvement occurred in Decision Tree algorithm which increased 43.81% accuracy. On the other hand, improvement in Naïve Bayes algorithm is still minor compared to other machine learning algorithms.
Article
Full-text available
Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalization and compound splitting, to knowledge-free approaches, such as n-gram indexing. Evaluations are carried out against data from the CLEF campaign, covering eight European languages. Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques.
Book
Cambridge Core - Linguistic Anthropology - Describing Morphosyntax - by Thomas E. Payne
Article
There have been several studies of the use of stemming algorithms for conflating morphological variants in free‐text retrieval systems. Comparison of stemmed and nonconflated searches suggests that there are no significant increases in the effectiveness of retrieval when stemming is applied to English‐language documents and queries. This article reports the use of stemming on Slovene‐language documents and queries, and demonstrates that the use of an appropriate stemming algorithm results in a large, and statistically significant, increase in retrieval effectiveness when compared with nonconflated processing; similar comments apply to the use of manual, right‐hand truncation. A comparison is made with stemming of English versions of the same documents and queries and it is concluded that the effectiveness of a stemming algorithm is determined by the morphological complexity of the language that it is designed to process. © 1992 John Wiley & Sons, Inc.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.