Conference PaperPDF Available

Stemming Indonesian

Authors:

Abstract and Figures

Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.
Content may be subject to copyright.
Stemming Indonesian
Jelita Asian Hugh E. Williams S.M.M. Tahaghoghi
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia.
{jelita,hugh,saied}@cs.rmit.edu.au
Abstract
Stemming words to (usually) remove suffixes has ap-
plications in text search, machine translation, docu-
ment summarisation, and text classification. For ex-
ample, English stemming reduces the words “com-
puter”, “computing”, “computation”, and “com-
putability” to their common morphological root,
“comput-”. In text search, this permits a search for
“computers” to find documents containing all words
with the stem “comput-”. In the Indonesian lan-
guage, stemming is of crucial importance: words
have prefixes, suffixes, infixes, and confixes that make
matching related words difficult. In this paper, we
investigate the performance of five Indonesian stem-
ming algorithms through a user study. Our results
show that, with the availability of a reasonable dictio-
nary, the unpublished algorithm of Nazief and Adri-
ani correctly stems around 93% of word occurrences
to the correct root word. With the improvements we
propose, this almost reaches 95%. We conclude that
stemming for Indonesian should be performed using
our modified Nazief and Adriani approach.
Keywords: stemming, Indonesian Information Re-
trieval
1 Introduction
Stemming is a core natural language processing
technique for efficient and effective Information Re-
trieval (Frakes 1992), and one that is widely accepted
by users. It is used to transform word variants to their
common root word by applying — in most cases —
morphological rules. For example, in text search, it
should permit a user searching using the query term
“stemming” to find documents that contain the terms
“stemmer” and “stems” because all share the com-
mon root word “stem”. It also has applications in
machine translation (Bakar & Rahman 2003), docu-
ment summarisation (Or˘asan, Pekar & Hasler 2004),
and text classification (Gaustad & Bouma 2002).
For the English language, stemming is well-
understood, with techniques such as those of
Lovin (1968) and Porter (1980) in widespread use.
However, stemming for other languages is less well-
known: while there are several approaches available
for languages such as French (Savoy 1993), Span-
ish (Xu & Croft 1998), Malaysian (Ahmad, Yusoff
& Sembok 1996, Idris 2001), and Indonesian (Arifin
& Setiono 2002, Nazief & Adriani 1996, Vega 2001),
Copyright c
2005, Australian Computer Society, Inc. This pa-
per appeared at the 28th Australasian Computer Science Con-
ference (ACSC2005), The University of Newcastle, Australia.
Conferences in Research and Practice in Information Technol-
ogy, Vol. 38. V. Estivill-Castro, Ed. Reproduction for aca-
demic, not-for profit purposes permitted provided this text is
included.
there is almost no consensus about their effectiveness.
Indeed, for Indonesian the schemes are neither easily
accessible nor well-explored. There are no compara-
tive studies that consider the relative effectiveness of
alternative stemming approaches for this language.
Stemming is essential to support effective Indone-
sian Information Retrieval, and has uses as diverse
as defence intelligence applications, document trans-
lation, and web search. Unlike English, a more com-
plex class of affixes — which includes prefixes, suf-
fixes, infixes (insertions), and confixes (combinations
of prefixes and suffixes) — must be removed to trans-
form a word to its root word, and the application
and order of the rules used to perform this process
requires careful consideration. Consider a simple ex-
ample: the word “minuman” (meaning “a drink”) has
the root “minum” (“to drink”) and the suffix “-an”.
However, many examples do not share the simple suf-
fix approach used by English:
“pemerintah” (meaning “government”) is de-
rived from the root “perintah” (meaning “‘gov-
ern”) through the process of inserting the infix
“em” between the “p-” and “-erintah” of “perin-
tah”.
“anaknya” (a possessive form of child, such as
“his/her child”) has the prefix “anak” (“child”)
and the suffix “nya” (a third person possessive)
“duduklah” (please sit) is “duduk” (“sit”) and
“lah” (a softening, equivalent to “please”).
“buku-buku” (books) is the plural of “buku”
(“book”)
These latter examples illustrate the importance of
stemming in Indonesian: without them, for example,
a document containing “anaknya” (“his/her child”)
does not match the term “anak” (“child”).
Several techniques have been proposed for stem-
ming Indonesian. We evaluate these techniques
through a user study, where we compare the per-
formance of the scheme to the results of manual
stemming by four native speakers. Our results show
that an existing technique, proposed by Nazief and
Adriani (1996) in an unpublished technical report,
correctly stems around 93% of all word occurrences
(or 92% of unique words). After classifying the fail-
ure cases, and adding our own rules to address these
limitations, we show this can be improved to 95%
for both unique and all word occurrences. We believe
that adding a more complete dictionary of root words
would improve these results even further. We con-
clude that our modified Nazief and Adriani stemmer
should be used in practice for stemming Indonesian.
2 Stemming Techniques
In this section, we describe the five schemes we have
evaluated for Indonesian stemming. In particular, we
detail the approach of Nazief and Adriani, which per-
forms the best in our evaluation of all approaches in
Section 4. We propose extensions to this approach in
Section 5.
2.1 Nazief and Adriani’s Algorithm
The stemming scheme of Nazief and Adriani is de-
scribed in an unpublished technical report from the
University of Indonesia (1996). In this section, we de-
scribe the steps of the algorithm, and illustrate each
with examples; however, for compactness, we omit
the detail of selected rule tables. We refer to this
approach as nazief.
The algorithm is based on comprehensive morpho-
logical rules that group together and encapsulate al-
lowed and disallowed affixes, including prefixes, suf-
fixes, infixes (insertions) and confixes (combination
of prefixes and suffixes). The algorithm also supports
recoding, an approach to restore an initial letter that
was removed from a root word prior to prepending
a prefix. In addition, the algorithm makes use of an
auxiliary dictionary of root words that is used in most
steps to check if the stemming has arrived at a root
word.
Before considering how the scheme works, we con-
sider the basic groupings of affixes used as a basis
for the approach, and how these definitions are com-
bined to form a framework to implement the rules.
The scheme groups affixes into categories:
1. Inflection suffixes the set of suffixes that
do not alter the root word. For example,
“duduk” (sit) may be suffixed with “-lah” to give
“duduklah” (please sit). The inflections are fur-
ther divided into:
(a) Particles (P) including “-lah” and
“-kah”, as used in words such as “duduklah”
(please sit)
(b) Possessive pronouns (PP) including
“-ku”, “-mu”, and “-nya”, as used in
“ibunya” (a third person possessive form of
“mother”)
Particle and possessive pronoun inflections can
appear together and, if they do, possessive pro-
nouns appear before particles. A word can have
at most one particle and one possessive pronoun,
and these may be applied directly to root words
or to words that have a derivation suffix. For
example, “makan” (to eat) may be appended
with derivation suffix “-an” to give “makanan”
(food). This can be suffixed with “-nya” to give
“makanannya” (a possessive form of “food”)
2. Derivation suffixes — the set of suffixes that are
directly applied to root words. There can be only
one derivation suffix per word. For example, the
word “lapor” (to report) can be suffixed by the
derivation suffix “–kan” to become “laporkan”
(go to report). In turn, this can be suffixed with,
for example, an inflection suffix “-lah” to become
“laporkanlah” (please go to report)
3. Derivation prefixes — the set of prefixes that are
applied either directly to root words, or to words
that have up to two other derivation prefixes.
For example, the derivation prefixes “mem-” and
“per-” may be prepended to “indahkannya” to
give “memperindahkannya” (the act of beautify-
ing)
The classification of affixes as inflections and
derivations leads to an order of use:
[DP+[DP+[DP+]]] root-word [[+DS][+PP][+P]]
Prefix Disallowed suffixes
be- -i
di- -an
ke- -i, -kan
me- -an
se- -i,-kan
te- -an
Table 1: Disallowed prefix and suffix combinations.
The only exception is that the root word “tahu” is
permitted with the prefix “ke-” and the suffix “-i”.
The square brackets indicate that an affix is optional.
The previous definition forms the basis of the rules
used in the approach. However, there are exceptions
and limitations that are incorporated in the rules:
1. Not all combinations are possible. For example,
after a word is prefixed with “di-”, the word is
not allowed to be suffixed with “-an”. A com-
plete list is shown in Table 1
2. The same affix cannot be repeatedly applied. For
example, after a word is prefixed with “te-” or
one of its variations, it is not possible to repeat
the prefix “te-” or any of those variations
3. If a word has one or two characters, then stem-
ming is not attempted.
4. Adding a prefix may change the root word or
a previously-applied prefix; we discuss this fur-
ther in our description of the rules To illus-
trate, consider “meng-” that has the variations
“mem-”,“meng-”,“meny-”, and “men-”. Some of
these may change the prefix of a word, for exam-
ple, for the root word “sapu” (broom), the vari-
ation applied is “meny-” to produce the word
“menyapu” (to sweep) in which the “s” is re-
moved
The latter complication requires that an effective In-
donesian stemming algorithm be able to add deleted
letters through the recoding process.
The algorithm itself employs three components:
the affix groupings, the order of use rules (and their
exceptions), and a dictionary. The dictionary is
checked after any stemming rule succeeds: if the resul-
tant word is found in the dictionary, then stemming
has succeeded in finding a root word, the algorithm
returns the dictionary word, and then stops; we omit
this lookup from each step in our rule listing. In ad-
dition, each step checks if the resultant word is less
than two characters in length and, if so, no further
stemming is attempted.
For each word to be stemmed, the following steps
are followed:
1. The unstemmed word is searched for in the dic-
tionary. If it is found in the dictionary, it is as-
sumed the word is a root word, and so the word
is returned and the algorithm stops.
2. Inflection suffixes (“-lah”, “-kah”, “-ku”, “-mu”,
or “-nya”) are removed. If this succeeds and
the suffix is a particle (“-lah” or “-kah”), this
step is again attempted to remove any inflec-
tional possessive pronoun suffixes (“-ku”, “-mu”,
or “-nya”).
3. Derivation suffix (“-i” or “-an”) removal is at-
tempted. If this succeeds, Step 4 is attempted.
If Step 4 does not succeed:
(a) If “-an” was removed, and the final letter
of the word is “-k”, then the “-k” is also
removed and Step 4 is re-attempted. If that
fails, Step 3b is performed.
Following Characters Prefix type
Set 1 Set 2 Set 3 Set 4
“-r-” “-r-” – – none
“-r-” vowel – – ter-luluh
“-r-” not (“-r-” or vowel) “-er-” vowel ter
“-r-” not (“-r-” or vowel) “-er-” not vowel none
“-r-” not (“-r-” or vowel) not “-er-” ter
not (vowel or “-r-”) “-er-” vowel none
not (vowel or “-r-”) “-er-” not vowel te
Table 2: Determining the prefix type for words prefixed with “te–”. If the prefix “te-” does not match one of
the rules in the table, then “none” is returned. Similar rules are used for “be–”, “me-”, and “pe-”.
(b) The removed suffix (“-i”, “-an”, or “-kan”)
is restored.
4. Derivation prefix removal is attempted. This has
several sub-steps:
(a) If a suffix was removed in Step 3, then disal-
lowed prefix-suffix combinations are checked
using the list in Table 1. If a match is found,
then the algorithm returns.
(b) If the current prefix matches any previous
prefix, then the algorithm returns.
(c) If three prefixes have previously been re-
moved, the algorithm returns.
(d) The prefix type is determined by one of the
following steps:
i. If the prefix of the word is “di-”, “ke-”,
or “se-”, then the prefix type is “di”,
“ke”, or “se” respectively.
ii. If the prefix is “te-”, “be-”, “me-”, or
“pe-”, then an additional process of ex-
tracting character sets to determine the
prefix type is required. As an example,
the rules for the prefix “te-” are shown
in Table 2. Suppose that the word be-
ing stemmed is “terlambat” (late). Af-
ter removing “te-” to give “-rlambat”,
the first set of characters are extracted
from the prefix according to the “Set 1”
rules. In this case, the letter following
the prefix “te-” is “r”, and this matches
the first five rows of the table. Follow-
ing “-r-” is “-l-” (Set 2), and so the
third to fifth rows match. Following
“-l-” is “-ambat”, eliminating the third
and fourth rows for Set 3 and deter-
mining that the prefix type is “ter-” as
shown in the rightmost column.
iii. If the first two characters do not match
“di-”, “ke-”, “se-”, “te-”, “be-”, “me-”,
or “pe-” then the algorithm returns.
(e) If the prefix type is “none”, then the al-
gorithm returns. If the prefix type is not
“none”, then the prefix type is found in
Table 3, the prefix to be removed found,
and the prefix removed from the word; for
compactness, Table 3 shows only the simple
cases and those matching Table 2.
(f) If the root word has not been found, Step 4
is recursively attempted for further prefix
removal. If a root word is found, the algo-
rithm returns.
(g) Recoding is performed. This step de-
pends on the prefix type, and can result
in different prefixes being prepended to the
stemmed word and checked in the dictio-
nary. For compactness, we consider only
the case of the prefix type “ter-luluh” shown
Prefix type Prefix to be removed
di di-
ke ke-
se se-
te te-
ter ter-
ter-luluh ter-
Table 3: Determining the prefix from the prefix type.
Only simple entries, and those for the te- prefix type
are shown.
in Tables 2 and 3. In this case, after re-
moving “ter-”, an “r-” is prepended to the
word. If this new word is not in the dictio-
nary, Step 4 is repeated for the new word.
If a root word is not found, then “r-” is re-
moved and “ter-” restored, the prefix is set
to “none”, and the algorithm returns.
5. Having completed all steps unsuccessfully, the al-
gorithm returns the original word.
2.2 Arifin and Setiono’s Algorithm
Arifin and Setiono (2002) propose a less complex
scheme than that of Nazief and Adriani, but one that
follows a similar approach of using a dictionary, pro-
gressively removing affixes, and dealing with recod-
ing. We refer to this approach as arifin. In sum-
mary, their approach attempts to remove all prefixes
and then all suffixes, stopping when the stemmed
word is found in a dictionary or the number of af-
fixes removed reaches a maximum of two prefixes and
three suffixes. If the stemmed word cannot be found
after prefix and suffix removal is completed, affixes
are restored to the word in all possible combinations
so that possible stemming errors are minimised.
The particular advantage of this approach is that,
if the word cannot be found after the removal of
all affixes, the algorithm then tries all combinations
of removed affixes. This process helps avoid over-
stemming when some parts of the words can be
mistaken as prefixes or suffixes. For example, the
word “diselamatkan” (to be saved) has the root word
“selamat” (safe). After the first step of removing the
prefix “di-”, the remaining word is “selamatkan” (to
save). Since the word is not found in the dictionary,
the algorithm removes “se-” (another valid prefix) to
give “lamatkan”; however, in this case, this is a mis-
take because “se” is not a prefix, but rather part of
the root word. With prefix removal complete, the
next step removes the suffix “-kan” to give “lamat”,
which is not a valid root word. The scheme then tries
combinations of restoring prefixes and suffixes and,
after restoring the prefix “se-”, creates “selamat” and
the result is a successful dictionary lookup.
The disadvantages of the scheme are two-fold.
First, repeated prefixes and suffixes are allowed; for
example, the word “peranan” (role, part) would be
overstemmed to “per” (spring) instead of the cor-
rect root word “peran” (to play the role of). Sec-
ond, the order of affix removal — prefix and then
suffix — can lead to incorrect stemming; for exam-
ple, the word “memberikan” (“to give away” in active
verb form) has the root word “beri”(to give away) but
the algorithm removes “mem-” and then “ber-” (both
valid prefixes) to give “ikan” (fish). The algorithm of
Nazief and Adriani does not share these problems.
2.3 Vega’s Algorithm
The approach of Vega (2001) is distinctly different
because it does not use a dictionary. As we show
later, this reduces its accuracy compared to other ap-
proaches, mostly because it requires that the rules be
correct and complete, and it prevents ad hoc restora-
tion of combinations of affixes and exploring whether
these match root words. In addition, the approach
does not apply recoding.
In brief, the approach works as follows. For each
word to be stemmed, rules are applied that attempt
to segment the word into smaller components. For ex-
ample, the word “didudukkan” (to be seated) might
be considered against the following rule: (di) +
stem(root) +(i |kan). This checks if the word begins
with “di-”, ends with “-i” or “-kan”.
There are four variants of this algorithm: stan-
dard, extended, iterative standard, and iterative ex-
tended. Standard deals with standard affix removal of
prefixes such as “ber-”, “di-”, and “ke-”, the suffixes
“-i”, “-an”, and “-nya”, and the infixes “-el-”, “-em-”,
“-er-”. In contrast, extended — unlike all other ap-
proaches described in this paper — deals with non-
standard affixes used in informal spoken Indonesian.
The iterative versions recursively stem words. In our
results, we report results with only the first scheme,
which we refer to as vega1; the other variants are
ineffective, performing between 10%–25% worse than
vega1.
2.4 Ahmad, Yusoff, and Sembok’s Algorithm
This approach (Ahmad et al. 1996) has two distinct
differences to the others: first, it was developed for
the closely-related Malaysian language, rather than
Indonesian; and, second, it does not progressively ap-
ply rules (we explain this next). We have addressed
the former issue by replacing the Malaysian dictio-
nary used by the authors with the Indonesian dictio-
naries discussed later. In practice, however, we could
have done more to adapt the scheme to Indonesian:
the sets of affixes are different between Indonesian
and Malaysian, some rules are not applied in Indone-
sian, and some rules applicable to Indonesian are not
used in Malaysian. We, therefore, believe that the re-
sults we report represent a baseline performance for
this scheme but it unclear how much improvement is
possible with additional work.
The algorithm itself is straightforward. An or-
dered list of all valid prefixes, suffixes, infixes, and
confixes is maintained. Before it begins, the algo-
rithm searches for the word in the dictionary and
simply returns this original word if the lookup suc-
ceeeds. If the word is not in the dictionary, then the
next rule in the rule list is applied to the word. If the
rule succeeds — for example, the word begins with
“me-” and ends with “-kan” — then the affixes are
removed and the stemmed word checked against the
dictionary. If it is found, then the stemmed word is
returned; if it is not found, then the stemmed word is
discarded and the next rule is applied to the original
word. If all rules fail to discover a stemmed word,
then the original word is returned. The advantage of
not progressively applying rules is that overstemming
is minimised. In addition, similarly to other success-
ful approaches, the scheme supports recoding.
Because the scheme is not progressive, its accu-
racy depends closely on the order of the rules. For
example, if infix rules are listed before prefix rules,
“berasal” (to come from) — for which the correct
stem is “asal” (origin or source) — is stemmed to
“basal” (basalt or dropsy) by removing the infix “-er-”
before the prefix “ber-”. Ahmad et al. have exper-
imented with several variations of the order of the
rules to explore the false positives generated by each
combination.
We report results with only one rule ordering —
which we refer to as ahmad2a — as we have found
that this is the most effective of the seven orderings,
and there is little difference between the variants; the
other schemes either perform the same — in the case
of ahmad2b — or at most 1% worse, in the case of
ahmad1.
Idris (2001) has further extended the scheme of
Ahmad et al. In the extension, a different recoding
scheme and progressive stemming is applied. Similar
to the basic approach, the algorithm checks the dic-
tionary after each step, stopping and returning the
stemmed word if it is found. The scheme works as
follows: first, after checking for the word in the dic-
tionary, the algorithm tests if the prefix of the word
matches a prefix that may require recoding; second, if
recoding is required, it is performed and the resultant
word searched for in the dictionary, while if recoding
is not required, the prefix is removed as usual and
the word checked in the dictionary; third, having re-
moved a prefix, suffix removal is attempted and, if it
succeeds, the word is checked in the dictionary; and,
last, the algorithm returns to the second step with
the partially stemmed word.
There are two variants of this algorithm: the first
changes prefixes and then performs recoding, while
the second does the reverse. In our results, we report
results with only the second scheme, which we refer
to as idris2; the other variant performs only 0.3%
worse under all measures.
3 Experiments
To investigate the performance of stemming schemes,
we have carried out a large user experiment. In this,
we compared the results of stemming with each of the
algorithms to manual stemming by native Indonesian
speakers. This section explains the collection we used,
the experimental design, and presents our results.
3.1 Collection
The collection of words to be stemmed were taken
from news stories at the Kompas online newspaper1.
In all, we obtained 9,901 documents. From these, we
extracted every fifth word occurrence and kept those
words that were longer than five characters in length;
our rationale for the minimum length is that shorter
words rarely require stemming, that is, they are al-
ready root words. Using this method, our word col-
lection contained 3,986 non-unique words and 1,807
unique words.
We chose to extract non-unique words to reflect
the real-world stemming problem encountered in text
search, document summarisation, and translation.
The frequency of word occurrences is highly skew:
in English, for example, “the” appears about twice
1See: http://www.kompas.com/
A B C D
A 3,674 3,689 3,564
B 3,588 3,555
C – 3,528
Table 4: Results of manual stemming by four Indone-
sian native speakers, denoted as A to D. The values
shown are the number of cases out of 3,986 where
participants agree.
as often as the next most common word; a similar
phenomena exists in Indonesian, where “yang” (a rel-
ative pronoun that is similar to “who”, “which”, or
“that”, or “the” if used with an adjective) is the most
common word. Given this skew, it is crucial that com-
mon words are correctly stemmed but less so that rare
words are.
We use the collection in two ways. First, we inves-
tigate the error rate of stemming algorithms relative
to manual stemming for the non-unique word collec-
tion. This permits quantifying the overall error rate
of a stemmer for a collection of real-world documents,
that is, it allows us to discover the total errors made.
Second, we investigate the error rate of stemming for
unique words only. This allows us to investigate how
many different errors each scheme makes, that is, the
total number of unique errors. Together, these allow
effective assessment of stemming accuracy.
3.2 Baselines
We asked four native Indonesian speakers to manually
stem each of the 3,986 words. The words were listed
in their order of occurrence, that is, repeated words
were distributed across the collection and words were
not grouped by prefix. Table 4 shows the results:
as expected, there is no consensus as to the root
words between the speakers and, indeed, the agree-
ment ranges from around 93% (for speakers A and
C) to less then 89% (for C and D). For example, the
word “bagian” (part) is left unstemmed in some cases
and stemmed to “bagi” (divide) in others, and simi-
larly “adalah” (to be) is sometimes stemmed to “ada”
(exists) and sometimes left unchanged. Indeed, the
latter example illustrates another problem: in some
cases, a speaker was inconsistent, on some occasions
stemming “adalah” to “ada”, and on others leaving
it unchanged.
Having established that native speakers disagree
and also make errors, we decided to use the major-
ity decision as the correct answer. Table 5 shows the
number of cases where three and four speakers agree.
All four speakers agree on only 82.6% of word oc-
curences while, at the other extreme, speakers B, C,
and D agree on 84.3%. The number of cases where
any three or all four speakers agree (shown as “Any
three”) is 95.3%. We use this latter case as our first
baseline to compare to automatic stemming: if a ma-
jority agree then we keep the original word in our
collection and note its answer as the majority deci-
sion; words that do not have a majority stemming
decision are omitted from the collection. We refer to
this baseline collection of 3,799 words as majority.
Interestingly, even the majority make errors. How-
ever, this is rare and so we accept the majority deci-
sion. For example, for “penebangan” (felling; cutting
down; chopping down), the correct stem is “tebang”
(to cut down; to chop down). The majority misread
the word as the more common “penerbangan” (flight;
flying), and stemmed it to “terbang” (to fly).
The correct stems are sometimes ambiguous; for
example, the suffix “-kan” can be removed from
“gerakan” (movement) to give “gera” (to frighten
ABCD ABC ABD ACD BCD Any three
3,292 3,493 3,413 3,408 3,361 3,799
Table 5: Consensus and majority agreement for man-
ual stemming by four Indonesian native speakers, de-
noted as A to D. The values shown are the number of
cases out of 3,986 where participants agree.
or threaten), or the suffix “-an” can be removed to
give “gerak” (to move); both are correct. We found
that all four human subjects stemmed this word to
“gerak”.
There are 1,751 unique unstemmed words used to
create the majority collection. However, after stem-
ming, the unique stemmed word collection that is
agreed by the majority has 1,753 words. This increase
of 2 words is due to cases such as “adalah” remaining
unstemmed by 3 out of 4 speakers in some cases and
being stemmed by 3 out of 4 to “ada” in other cases.
We refer to the collection of unique stemmed words
that are agreed by the majority as unique, and we
use this as our second baseline for automatic stem-
ming.
We have also investigated the performance of
automatic stemming schemes when the complete
collection is used and they stem to any of the manual
stemming results. For example, consider a case where
the word “spiritual” (spiritual) is stemmed by two
speakers to “spiritual”, by a third to “spirit” (spirit),
and the fourth to “ritual” (ritual). In this case, if an
automatic approach stems to any of the manual three
stems, we deem it has correctly stemmed the word.
Not surprisingly, all automatic schemes perform
better under this measure. However, the results do
not show any change in relative performance between
the schemes and we omit the results from this paper
for compactness.
4 Results
Table 6 shows the results of our experiments using
the majority and unique collections. The nazief
scheme works best: it correctly stems 93% of word oc-
currences and 92% of unique words, making less than
two-thirds of the errors of the second-ranked scheme,
ahmad2a. The remaining dictionary schemes —
ahmad2a,idris2, and arifin — are comparable and
achieve 88%–89% on both collections. The only non-
dictionary scheme, vega1, makes almost five times as
many errors on the majority collection as nazief,
illustrating the importance of validating decisions us-
ing an external word source.
Interestingly, the idris2 approach offers no im-
provement to the ahmad scheme on which it is based.
On the unique collection, idris2 is 0.5% or eight
words better than ahmad2a. However, on the ma-
jority collection, idris2 is 0.9% or 34 words worse
than ahmad2a. This illustrates an important char-
acteristic of our experiments: stemming algorithms
should be considered in the context of word occur-
rences and not unique words. While idris2 makes
less errors on rare words, it makes more errors on
more common words, and is less effective overall for
stemming document collections.
The performance of the nazief scheme is impres-
sive and, for this reason, we focus on it in the remain-
der of this paper. Under the strict majority model
— where only one answer is allowed — the scheme
incorrectly stems less than 1 in 13 words of longer
than 5 characters; in practice, when short words are
included, this is an error rate of less than 1 in 21
Stemmer majority unique
Correct Errors Correct Errors
(%) (words) (%) (words)
nazief 92.8 272 92.1 139
ahmad2a 88.8 424 88.3 205
idris2 87.9 458 88.8 197
arifin 87.7 466 88.0 211
vega1 66.3 1,280 69.4 536
Table 6: Automatic stemming performance compared
to the majority and unique baseline collections.
word occurrences. However, there is still scope for
improvement: even under a model where all 3,986
word occurrences are included and any answer pro-
vided by a native speaker is deemed correct, the al-
gorithm achieves only 95%. Considering both cases,
therefore, there is scope for an at least 5% improve-
ment in performance by eliminating failure cases and
seeking to make better decisions in non-majority de-
cision cases. We consider and propose improvements
in the next section.
5 Improving the Nazief and Adriani Stem-
mer
In this section, we discuss the reasons why the nazief
scheme works well, and what aspects of it can be im-
proved. We present a detailed analysis of the failure
cases, and propose solutions to these problems. We
then present the results of including these improve-
ments, and describe our modified nazief approach.
5.1 Discussion
The performance of the nazief approach is perhaps
unsurprising: it is by far the most complex approach,
being based closely on the detailed morphological
rules of the Indonesian language. In addition, it sup-
ports dictionary lookups and progressive stemming,
allowing it to evaluate each step to test if a root word
has been found and to recover from errors by restor-
ing affixes to attempt different combinations. How-
ever, despite these features, the algorithm can still be
improved.
In Table 7, we have classified the failures made
by the nazief scheme on the majority collection.
The two most significant faults are dictionary related:
around 33% of errors are the result of non-root words
being in the dictionary, and around 11% are the result
of root words not being in the dictionary. Hyphen-
ated words — which we discuss in more detail in the
next section — contribute 15.8% of the errors. Of
the remaining errors, around 49 errors or 18% are re-
lated to rules and rule precedence. The remaining
errors are foreign words, misspellings, acronyms, and
proper nouns.
In summary, three opportunities exist to improve
stemming with nazief. First, a more complete and
accurate root word dictionary may reduce errors. Sec-
ond, features can be added to support stemming
of hyphenated words. Last, new rules and adjust-
ments to rule precedence may reduce over- and under-
stemming, as well as support affixes not currently
catered for in the algorithm. We discuss the improve-
ments we propose in the next section.
5.2 Improvements
To address the limitations of the nazief scheme, we
propose the following improvements:
1. Using a more complete dictionary — we have
experimented with two other dictionaries, and
present our results later.
2. Adding rules to deal with plurals — when plu-
rals, such as “buku-buku” (books) are encoun-
tered, we propose stemming these to “buku”
(book). However, care must be taken with other
hyphenated words such as “bolak-balik” (to and
fro), “berbalas-balasan” (mutual action or inter-
action) and “seolah-olah” (as though). For these
later examples, we propose stemming the words
preceding and following the hyphen separately
and then, if the words have the same root word,
to return the singular form. For example, in the
case of “berbalas-balasan”, both “berbalas” and
“balasan” stem to “balas” (response or answer),
and this is returned. In contrast, the words
“bolak” and “balik” do not have the same stem,
and so “bolak-balik” is returned as the stem; in
this case, this is the correct action, and this works
for many hyphenated non-plurals.
3. Adding prefixes and suffixes, and additional
rules:
(a) Adding the particle (inflection suffix)
“-pun”2. This is used in words such as
“siapapun” (where the root word is “siapa”
[who]).
(b) For the prefix type “ter”, we have modi-
fied the conditions so that row 4 in Table 2
sets the type to “ter” instead of “none”.
This supports cases such as “terpercaya”
(the most trusted), which has the root word
“percaya” (believe).
(c) For the prefix type “pe”, we have modi-
fied the conditions (similar to those listed
in Table 2) so that words such as “pekerja”
(worker) and “peserta” (member) have pre-
fix type “pe”, instead of the erroneous
“none”.
(d) For the prefix type “mem”, we have modi-
fied the conditions so that words beginning
with the prefix “memp-” are of type “mem”.
(e) For the prefix type “meng”, we have mod-
ified the conditions so that words begin-
ning with the prefix “mengk-” are of type
“meng”.
4. Adjusting rule precedence:
(a) If a word is prefixed with “ber-” and suf-
fixed with the inflection suffix “-lah”, try
to remove prefix before the suffix. This
addresses problems with words such as
“bermasalah” ([having a problem] where
the root word is “masalah” [problem]) and
“bersekolah” ([be at school] where the root
word is “sekolah”[school]).
(b) If a word is prefixed with “ber-” and suffixed
with the derivation suffix “-an”, try to re-
move prefix before the suffix. This solves
problems with, for example, “berbadan”
([having the body of] the root word is
“badan” [body]).
(c) If a word is prefixed with “men-” and suf-
fixed with the derivation suffix “-i”, try to
remove prefix before the suffix. This solves
problems with, for example, “menilai” ([to
mark] the root word is “nilai” [mark]).
2The inflection suffix “-pun” is mentioned in the technical report
but is not included by Nazief and Adriani in their implementation.
Examples
Fault Class Original Error Correct Total Cases
Non-root words in dictionary sebagai sebagai bagai 91
Hyphenated words buku-buku buku-buku buku 43
Incomplete dictionary bagian bagi bagian 31
Misspellings penambahanan penambahanan tambah 21
Incomplete affix rules siapapun siapapun siapa 20
Overstemming berbadan bad badan 19
Peoples’ names Abdullah Abdul Abdullah 13
Names minimi minim minimi 9
Combined words pemberitahuan pemberitahuan beritahu 7
Recoding ambiguity (dictionary related) berupa upa rupa 7
Acronyms pemilu milu pemilu 4
Recoding ambiguity (rule related) peperangan perang erang 2
Other sekali sekali kali 2
Understemming mengecek ecek cek 1
Foreign words mengakomodir mengakomodir akomodir 1
Human error penebangan terbang tebang 1
Total 272
Table 7: Classified failure cases of the nazief stemmer on the majority collection. The total shows the total
occurrences, not the number of unique cases.
Stemmer majority unique
Correct Errors Correct Errors
(%) (words) (%) (words)
Original 92.8 272 92.1 139
(A) Alternative KBBI dictionary 88.8 426 86.9 229
(B) Alternative Online dictionary 93.8 236 92.5 131
(C) Adding repeated word rule 93.9 232 94.0 105
(D) Changes to rule precedence 93.3 255 92.7 128
(E) Adding additional affixes 93.3 253 92.8 127
(F) Combining (C) + (D) + (E) 94.8 196 95.3 82
Table 8: Improvements to the nazief stemmer, measured with the majority and unique baseline collections.
(d) If a word is prefixed with “di-” and suffixed
with the derivation suffix “-i”, try to remove
prefix before the suffix. This solves prob-
lems with, for example, “dimulai” ([to be
started] the root word is “mulai” [start]).
(e) If a word is prefixed with “pe-” and suffixed
with the derivation suffix “-i”, try to remove
prefix before the suffix. This solves prob-
lems with, for example, “petani” ( [farmer]
the root word is “tani” [farm]).
(f) If a word is prefixed with “ter-” and suffixed
with the derivation suffix “-i”, try to remove
prefix before the suffix. This solves prob-
lems with, for example, “terkendali” ([can
be controlled] the root word is “kendali”
[control]).
We present results with these improvements in the
next section.
5.3 Results
Table 8 shows the results of our improvements
to the nazief stemmer. Using a different, well-
curated dictionary does not guarantee an improve-
ment: the second and third rows show the result when
the 29,337 word dictionary used in developing the
original nazief approach is replaced with the 27,828
word Kamus Besar Bahasa Indonesia (KBBI) dic-
tionary and with an online dictionary of unknown
size3. Despite the KBBI dictionary being perhaps
more comprehensive than the original, performance
actually drops by 4.0% on the majority collection,
3See: http://nlp.aia.bppt.go.id/kebi/
and 5.2% on the unique words. We believe this is due
to three factors: first, dictionaries often contain un-
stemmed words and, therefore, can cause stemming to
stop before the root word is found; second, the dictio-
nary is only part of the process and its improvement
addresses only some of the failure cases; and, last,
inclusion of new, rare words can cause matches with
incorrectly or overstemmed common words, leading
to decreases in performance for some cases while still
improving others. To test our other improvements,
we used only the original dictionary.
The fourth, fifth, and sixth rows show the effect of
including the algorithmic improvements we discussed
in the previous section. The results show the accu-
racy gains of including only the improvement into the
original version, while the final row shows the additive
effect of including all three. Dealing with repeated
words improves the majority result by 1.1% and
the unique result by 1.9%. Adjustments to the rule
precedence improves the results by 0.5% and 0.6%
on the two collections, and adding additional affixes
improves results by 0.5% on majority and 0.7% on
unique . The combined effect of the three improve-
ments lowers the error rate to 1 in 19 words of 5 or
more characters in length, or an average of only 1
error every 38 words in the original Kompas collec-
tion. The overall outcome is highly effective Indone-
sian stemming using our modified nazief stemmer.
6 Conclusion
Stemming is an important Information Retrieval tech-
nique. In this paper, we have investigated Indonesian
stemming and, for the first time, presented an exper-
imental evaluation of Indonesian stemmers. Our re-
sults show that a successful stemmer is complex, and
requires the careful combination of several features:
support for complex morphological rules, progressive
stemming of words, dictionary checks after each step,
trial-and-error combinations of affixes, and recoding
support after prefix removal.
Our evaluation of stemmers followed a user study.
Using four native speakers and a newswire collection,
we evaluated five automatic stemmers. Our results
show that the nazief stemmer is the most effec-
tive scheme, making less than 1 error in 21 words
on newswire text. With detailed analysis of failure
cases and modifications, we have improved this to less
than 1 error in 38 words. We conclude that the mod-
ified nazief stemmer is a highly effective tool.
We intend to continue this work. We will improve
the dictionaries by curating them to remove non-root
and add root words. We also plan to further extend
the nazief stemmer to deal with cases where the root
word is ambiguous.
Acknowledgments
We thank Bobby Nazief for providing source code and
the dictionary used in this paper, Vinsensius Berlian
Vega for source code, Riky Irawan for his Indonesian
corpus, and Gunarso for the Kamus Besar Bahasa
Indonesia (KBBI) dictionary. We also thank Wahyu
Wibowo for his help in answering our queries and Eric
Dharmazi, Agnes Julianto, Iman Suyoto, and Hendra
Yasuwito for their help in manually stemming our col-
lection. This work is supported by the Australian
Research Council.
References
Ahmad, F., Yusoff, M. & Sembok, T. M. T. (1996),
‘Experiments with a Stemming Algorithm for
Malay Words’, Journal of the American Society
for Information Science 47(12), 909–918.
Arifin, A. Z. & Setiono, A. N. (2002), Classification of
Event News Documents in Indonesian Language
Using Single Pass Clustering Algorithm, in ‘Pro-
ceedings of the Seminar on Intelligent Technol-
ogy and its Applications (SITIA)’, Teknik Elek-
tro, Sepuluh Nopember Institute of Technology,
Surabaya, Indonesia.
Bakar, Z. A. & Rahman, N. A. (2003), Evaluating the
effectiveness of thesaurus and stemming meth-
ods in retrieving Malay translated Al-Quran doc-
uments, in T. M. T. Sembok, H. B. Zaman,
H. Chen, S.R.Urs & S. Myaeng, eds, ‘Digital Li-
braries: Technology and Management of Indige-
nous Knowledge for Global Access’, Vol. 2911
of Lecture Notes in Computer Science, Springer-
Verlag, pp. 653 – 662.
Frakes, W. (1992), Stemming algorithms, in
W. Frakes & R. Baeza-Yates, eds, ‘Informa-
tion Retrieval: Data Structures and Algorithms’,
Prentice-Hall, chapter 8, pp. 131–160.
Gaustad, T. & Bouma, G. (2002), ‘Accurate stem-
ming of Dutch for text classification’, Language
and Computers 45(1), 104–117.
Idris, N. (2001), Automated Essay Grading System
Using Nearest Neighbour Technique in Infor-
mation Retrieval, Master’s thesis, University of
Malaya.
Lovins, J. (1968), ‘Development of a stemming al-
gorithm’, Mechanical Translation and Computa-
tion 11(1-2), 22–31.
Nazief, B. A. A. & Adriani, M. (1996), Confix-
stripping: Approach to Stemming Algorithm for
Bahasa Indonesia. Internal publication, Faculty
of Computer Science, University of Indonesia,
Depok, Jakarta.
Or˘asan, C., Pekar, V. & Hasler, L. (2004), A compar-
ison of summarisation methods based on term
specificity estimation, in ‘Proceedings of the
Fourth International Conference on Language
Resources and Evaluation (LREC2004)’, Lisbon,
Portugal, pp. 1037 – 1041.
Porter, M. (1980), ‘An algorithm for suffix stripping’,
Program 13(3), 130–137.
Savoy, J. (1993), ‘Stemming of French words based on
grammatical categories’, Journal of the Ameri-
can Society for Information Science 44(1), 1–9.
Vega, V. B. (2001), Information Retrieval for the
Indonesian Language, Master’s thesis, National
University of Singapore.
Xu, J. & Croft, W. (1998), ‘Corpus-based stem-
ming using cooccurrence of word variants’, ACM
Transactions on Information Systems 16(1), 61–
81.
... Algoritma stemming digunakan untuk mengurangi perbedaan bentuk dari suatu kata dengan mengembalikannya ke dalam bentuk kata dasar sehingga proses temu kembali menjadi efektif (Asian, Williams, & Tahaghoghi, 2005). Algoritma stemming yang digunakan oleh penulis adalah algoritma Porter berbasis Snowball (Porter, 2001). ...
Research
Full-text available
Karya ilmiah dengan judul Emotex: Implementasi Sentiment Analysis menggunakan Accord.NET dan Naive Bayes Classifier ini ditulis selain untuk mengikuti perlombaan Piala MIPA 2018, karya ilmiah ini juga ditulis untuk menguraikan bagaimana cara membuat aplikasi untuk melakukan analisis sentimen menggunakan Accord.NET dan naive bayes classifier. Teknik analisis data yang digunakan adalah Teknik Deskriptif Kuantitatif. Setelah data dikumpulkan, selanjutnya adalah training dan evaluation. Setelah proses testing, dapat diketahui berbagai faktor dari hasil training menggunakan confusion matrix dan ROC curve. Naive bayes merupakan supervised learning dan metode statistika yang digunakan untuk melakukan klasifikasi. Implementasi naive bayes classifier ini terdapat pada library Accord.NET yang dapat digunakan pada bahasa pemprograman berbasis .NET. Model yang dibuat menggunakan naive bayes dengan jumlah training data sebanyak 1000 sentimen memberikan akurasi sebesar 66,30%. Hasil tersebut didapatkan setelah data mentah diolah menggunakan tokenizer, penghilangan stop words, dan stemming.
... In this research, the stemming is carried out using a confix-stripping approach called CS Stemmer, which is a process of removing a confix (a combination of prefix and suffix in a word) based on the order of appearance using a root word dictionary [21]. This stemming model (stemmer) can separate the root from 175 derivative words that contain a certain combination of prefix and suffix. ...
Article
Full-text available
A phonemicization or grapheme-to-phoneme conversion (G2P) is a process of converting a word into its pronunciation. It is one of the essential components in speech synthesis, speech recognition, and natural language processing. The deep learning (DL)-based state-of-the-art G2P model generally gives low phoneme error rate (PER) as well as word error rate (WER) for high-resource languages, such as English and European, but not for low-resource languages. Therefore, some conventional machine learning (ML)-based G2P models incorporated with specific linguistic knowledge are preferable for low-resource languages. However, these models are poor for several low-resource languages because of various issues. For instance, an Indonesian G2P model works well for roots but gives a high PER for derivatives. Most errors come from the ambiguities of some roots and derivative words containing four prefixes: 〈ber〉,〈meng〉,〈peng〉, and 〈ter〉. In this research, an Indonesian G2P model based on n-gram combined with stemmer and phonotactic rules (NGTSP) is proposed to solve those problems. An investigation based on 5-fold cross-validation, using 50 k Indonesian words, informs that the proposed NGTSP gives a much lower PER of 0.78% than the state-of-the-art Transformer-based G2P model (1.14%). Besides, it also provides a much faster processing time.
... Stemming aims to turn a word into a root word by removing the phrase prefix and prefix [8]. This study uses Sastrawi Stemmer adapted from the Nazief-Andriani [18] algorithm with a modified confix-stripping [19]. i. Stopwords Removal is a word that often appears and does not have any meaning [9]. ...
Article
Full-text available
Preprocessing is an essential task for sentiment analysis since textual information carries a lot of noisy and unstructured data. Both stemming and stopword removal are pretty popular preprocessing techniques for text classification. However, the prior research gives different results concerning the influence of both methods toward accuracy on sentiment classification. Therefore, this paper conducts further investigations about the effect of stemming and stopword removal on Indonesian language sentiment analysis. Furthermore, we propose four preprocessing conditions which are with using both stemming and stopword removal, without using stemming, without using stopword removal, and without using both. Support Vector Machine was used for the classification algorithm and TF-IDF as a weighting scheme. The result was evaluated using confusion matrix and k-fold cross-validation methods. The experiments result show that all accuracy did not improve and tends to decrease when performing stemming or stopword removal scenarios. This work concludes that the application of stemming and stopword removal technique does not significantly affect the accuracy of sentiment analysis in Indonesian text documents.
Article
Full-text available
With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is difficulty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) files. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classification result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset.
Article
Full-text available
Finding relevant information from a collection of information requires a process of stemming. Stemming is the process of combining or solving each morphological variants of a word into a basic word. Based on the basic structure of the word morphology, Porter’s stemming looks appropriate to be applied in conducting basic word searches in Indonesian-language documents, but with a few modifications. For this need, an Information Retrieval Technique for Indonesian PDF Document Application Using PHP from Indonesian documents is made using the Modified Stemming Porter Method. Implementation of the application was carried out using the Php (Hypertext Preprocessor) programming language. Testing was performed on 26 pdf e-book documents are 23,197 basic words out of 28,532 total words. the experiment found 94% as the largest percentage of precision words in the document. And the results obtained 81% as the lowest percentage of the basic words that are precise in the document. The results obtained from the test are that the application can operate well in conducting stemming on e-books in Indonesian.
Article
Full-text available
Fuzzy C-Means (FCM) is one of the best-known clustering algorithms, however, FCM is significantly sensitive to the initial cluster center values and easily trapped in a local optimum. To overcome this problem, this study proposes and improved FCM with Particle Swarm Optimization (PSO) algorithm for high dimensional and unstructured sentiment clustering. PSO is applied for the determination of better cluster center initials. The results showed that FCM-PSO can provide better performance compared to the conventional FCM in terms of Rand Index, F-measure and Objective Function Values (OFV). The better OFV value indicates that FCM-PSO requires faster convergence time and better noise handling.
Article
Full-text available
Stemming is the process of separating essential words from an affixed word. Stemming works by eliminating morphological variations which attached to a word by removing affixes in a word with a dictionary as a reference to the stemming process. One effective algorithm for resolving stemming in Indonesian is an Enhanced Confix Stripping Stemmer algorithm. In this case, the improved Confix stripping stemmer algorithm is implemented for the separation of essential words in the Angola-Mandailing Batak language document that has different phonologies and morphologies with Indonesian. The documents used are Latin character documents in the Angkola-Mandailing language that goes through the stages of filtering, folding cases, and tokenization before stemming. The stemming process is done by removing particles, ownership, suffixes, and prefixes. After testing this research, it was concluded that this algorithm was able to separate essential words in the Angkola-Mandailing Batak language with accuracy 87,05%.
Article
Full-text available
Essay Examination is one of many ways to evaluate the learning process of students. The Essay test can measure a student’s capability of memory and to develop an idea. With the vast development of Information Technology, an essay test is done in a more sophisticated way with a platform called e-learning. This research implemented an e-learning interface to conduct an essay test, starting with giving out questions, answering it and marking it. The system will generate an automatic mark, using the Latent Semantic Analysis (LSA) method by measuring the relevance of the key answer and the student’s answer. Before entering the LSA phase, the answer will go through pre-processing step which is cleansing, case folding, tokenization, stop word, convert negation and stemming. The result of the accuracy of this system compared to teacher’s manual assessment is 83.3%. Expectedly, this research will help teachers in making their assessment process more efficient.
Article
Full-text available
Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalization and compound splitting, to knowledge-free approaches, such as n-gram indexing. Evaluations are carried out against data from the CLEF campaign, covering eight European languages. Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques.
Book
Cambridge Core - Linguistic Anthropology - Describing Morphosyntax - by Thomas E. Payne
Article
There have been several studies of the use of stemming algorithms for conflating morphological variants in free‐text retrieval systems. Comparison of stemmed and nonconflated searches suggests that there are no significant increases in the effectiveness of retrieval when stemming is applied to English‐language documents and queries. This article reports the use of stemming on Slovene‐language documents and queries, and demonstrates that the use of an appropriate stemming algorithm results in a large, and statistically significant, increase in retrieval effectiveness when compared with nonconflated processing; similar comments apply to the use of manual, right‐hand truncation. A comparison is made with stemming of English versions of the same documents and queries and it is concluded that the effectiveness of a stemming algorithm is determined by the morphological complexity of the language that it is designed to process. © 1992 John Wiley & Sons, Inc.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.
Article
The first Text REtrieval Conference (TREC-1) was held in early November 1992 and was attended by about 100 people working in the 25 participating groups. The goal of the conference was to bring research groups together to discuss their work on a new large test collection. There was a large variety of retrieval techniques reported on, including methods using automatic thesaurii, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching. As results had been run through a common evaluation package, groups were able to compare the effectiveness of different techniques, and discuss how differences among the sytems affected performance.