Conference PaperPDF Available

A Testbed for Indonesian Text Retrieval.

Authors:

Abstract

Indonesia is the fourth most populous country and a close neighbour of Australia. However, despite media and intelligence interest in Indonesia, little work has been done on evaluating Information Retrieval techniques for Indonesian, and no standard testbed exists for such a purpose. An effective testbed should include a collection of documents, realistic queries, and relevance judgements. The TREC and TDT testbeds have provided such an environment for the evaluation of English, Mandarin, and Arabic text retrieval techniques. The NTCIR testbed provides a similar environment for Chinese, Korean, Japanese, and English. This paper describes an Indonesian TREC-like testbed we have constructed and made available for the evaluation of ad hoc retrieval techniques. To illustrate how the test collection is used, we briefly report the effect of stemming for Indonesian text retrieval, showing — similarly to English — that it has little effect on accuracy.
A preview of the PDF is not available
... Metodologi penelitian merupakan suatu kerangka dan asumsi yang ada dalam melakukan elaborasi penelitian sedangkan metode penelitian memerlukan teknik atau prosedur untuk menganalisa data yang ada. Dari pengertian tersebut dapat disimpulkan 5 Ada sekitar 1000 metodologi pengembangan SI 6 . Metodologi tersebut ada yang mirip satu sama lain, dan ada yang sangat spesifik terhadap suatu organisasi. ...
... Corpus ilmiah adalah koleksi dokumen hasil penelitian yang dilakukan dalam lingkungan institusi Badan Tenaga Atom Nasional, terdiri dari 1162 buah dokumen, yang merupakan hasil penelitian dalam rentang waktu antara tahun 1985 sampai dengan tahun 1994 [3,6]. Sedangkan corpus berita merupakan kumpulan artikel yang dimuat antara Januari dan Juni 2002 dalam surat kabar harian Indonesia, Kompas on line, terdiri dari 3000 buah dokumen [5]. ...
... Corpus ilmiah adalah koleksi dokumen hasil penelitian yang dilakukan dalam lingkungan institusi Badan Tenaga Atom Nasional, terdiri dari 1162 buah dokumen, yang merupakan hasil penelitian dalam rentang waktu antara tahun 1985 sampai dengan tahun 1994 [3,6]. Sedangkan corpus berita merupakan kumpulan artikel yang dimuat antara Januari dan Juni 2002 dalam surat kabar harian Indonesia, Kompas on line, terdiri dari 3000 buah dokumen [5]. S 2 2370 3035 788 1094 376 531 3 539 666 125 134 53 59 4 86 106 9 12 3 3 5 13 12 2 2 0 0 6 3 3 2 2 1 1 7 1 1 ). ...
... creating stemming algorithms and parser), semantic and discourse analysis (i.e. based on lexical semantics and text semantic analysis), document summarization, question-answering, information extraction, cross language retrieval, and geographic information retrieval. Other significant studies conducted by Asian which proposed an effective techniques for Indonesian text retrieval [10] and published the first Indonesian testbed [11]. It is worth to mention that despite the long list of works ever mentioned, only limited number of the results is available publicly and among those Indonesian studies, it is hardly to find a work pertaining to automatic ontology constructor specifically. ...
... Since it was formulated, tolerance rough sets model (TRSM) is accepted as a tool to model a document in a richer way than the base representation which is represented by a vector of TF*IDF-weight terms 11 (let us call it TFIDFrepresentation). The richness of the document representation produced by applying the TRSM (let us call it TRSM-representation) is indicated by the number of index terms put into the model. ...
... The aim of evaluation is to validate all of our proposed strategies. Consecutively in following sections, we will discuss the effectiveness of tolerance value generator algorithm, the contributive factors of thesaurus optimization, and the lexicon-based document representation by means of employing another Indonesian corpus, called Kompas-corpus 40 [11], into the retrieval system. ...
Article
Full-text available
The research of Tolerance Rough Sets Model (TRSM) ever conducted acted in accordance with the rational approach of AI perspective. This article, which was a doctoral thesis, presented studies who complied with the contrary path, i.e. a cognitive approach, for an objective of a modular framework of semantic text retrieval system based on TRSM specifically for Indonesian. In addition to the proposed framework , this thesis proposes three methods based on TRSM, which are the automatic tolerance value generator, thesaurus optimization, and lexicon-based document representation. All methods were developed by the use of our own corpus, namely ICL-corpus, and evaluated by employing an available Indonesian corpus, called Kompas-corpus. The endeavor of a semantic information retrieval system is the effort to retrieve information and not merely terms with similar meaning. This thesis is a baby step toward the objective.
... Metodologi penelitian merupakan suatu kerangka dan asumsi yang ada dalam melakukan elaborasi penelitian sedangkan metode penelitian memerlukan teknik atau prosedur untuk menganalisa data yang ada. Dari pengertian tersebut dapat disimpulkan 5 Ada sekitar 1000 metodologi pengembangan SI 6 . Metodologi tersebut ada yang mirip satu sama lain, dan ada yang sangat spesifik terhadap suatu organisasi. ...
... Corpus ilmiah adalah koleksi dokumen hasil penelitian yang dilakukan dalam lingkungan institusi Badan Tenaga Atom Nasional, terdiri dari 1162 buah dokumen, yang merupakan hasil penelitian dalam rentang waktu antara tahun 1985 sampai dengan tahun 1994 [3,6]. Sedangkan corpus berita merupakan kumpulan artikel yang dimuat antara Januari dan Juni 2002 dalam surat kabar harian Indonesia, Kompas on line, terdiri dari 3000 buah dokumen [5]. ...
... Corpus ilmiah adalah koleksi dokumen hasil penelitian yang dilakukan dalam lingkungan institusi Badan Tenaga Atom Nasional, terdiri dari 1162 buah dokumen, yang merupakan hasil penelitian dalam rentang waktu antara tahun 1985 sampai dengan tahun 1994 [3,6]. Sedangkan corpus berita merupakan kumpulan artikel yang dimuat antara Januari dan Juni 2002 dalam surat kabar harian Indonesia, Kompas on line, terdiri dari 3000 buah dokumen [5]. S 2 2370 3035 788 1094 376 531 3 539 666 125 134 53 59 4 86 106 9 12 3 3 5 13 12 2 2 0 0 6 3 3 2 2 1 1 7 1 1 ). ...
... Tokenisasi merupakan proses pencacahan sebuah string masukan menjadi unit-unit terkecil (kata) yang menyusunnya [11]. Pada prinsipnya, tujuan dari dilakukan proses ini adalah untuk mengetahui unit-unit terkecil yang menyusun sebuah dokumen [12]. ...
Article
Full-text available
Perkembangan teknologi dan penyebaran informasi di internet terus mengalami peningkatan. Salah satu bentuk informasi yang jumlahnya terus bertambah adalah berita. Media cetak dan elektronik yang kini telah dikemas dalam bentuk digital atau sering dikenal dengan portal berita online atau media online. PT Merah Putih Media merupakan media berita online. Berita yang disampaikan terdiri dari tiga kategori mulai dari berita tentang Indonesia, Hiburan dan Gaya Hidup, serta Olahraga. Namun, pembagian artikel berita ke dalam kategori dilakukan secara manual oleh kepala redaksi jurnalis. Text Mining adalah salah satu teknik yang dapat digunakan untuk melakukan klasifikasi sebuah dokumen. Pada penelitian ini dilakukan klasifikasi kategori otomatis dengan algoritma Multinomial Naïve Bayes, Complement Naïve Bayes, dan gabungan kedua model. Model yang memiliki performa terbaik dinilai dari metrik F1-Score dengan jumlah pembagian data latih dan data uji sebanyak 80:20, diperoleh keberhasilan performa sebesar 90,13% F1-Score.
... A book containing thesaurus in Bahasa has been published but the number of items is small [9]. Many researches claim that they use data repository they have built during research analysis but most repositories are not openly available [10,11]. Developing language repositories can be costly for data collection, annotation, and validation [12,13]. ...
Article
Full-text available
Language repository is valuable as a reference in using the language, its preservation, and in developing and implementation of natural language processing algorithms. Bahasa Indonesia is one of natural languages that hardly has repository despite its large number of speakers and previous attempts to build ones. We devised a way to develop repository of phrase definition in Bahasa using a kind of crowdsourcing and investigated its implementation. An application add-on was inserted to an information system that manages final year projects of undergraduate students. The add-on invites students to participate in writing keyword definition and validating definition. Investigation in a period of six months reveals that about 25% of application users take parts into the voluntary activities either as definition writers and/or validators. During the period, about 1200 phrase definitions were added into the repository and in average each definition is validated by two participants. The activity is supported by users that are well aware of the tasks, and have positive perception about the work, despite different reasons that motivate their contribution.
... Penentuan similaritas secara semantik lebih akurat daripada perhitungan similaritas berdasarkan jumlah kata yang tepat sama [6]. Namun, penerapan algoritma similaritas semantik untuk teks bahasa Indonesia belum banyak dilakukan karena berbagai kendala di antaranya karena belum adanya jejaring kata bahasa Indonesia dan belum ada himpunan data uji yang standar (standard test bed) untuk pengujian algoritma [7]. ...
Article
Full-text available
Algoritma similaritas terhadap teks telah diterapkan pada berbagai aplikasi seperti deteksi plagiasi, pengelompokan dokumen, klasifikasi teks berita, mesin penjawab otomatis dan aplikasi penerjemahan bahasa. Beberapa aplikasi telah menunjukkan hasil yang baik. Sayangnya, upaya menerapkan algoritma similaritas semantik belum cukup berhasil terhadap teks bahasa Indonesia karena minimnya koleksi basis pengetahuanbahasa Indonesia, misalnya terkait keberadaan tesaurus atau word net. Penelitian ini berfokus pada upaya menghimpun hiponim dan meronim pada bahasa Indonesia, membangun korpus pasangan kalimat yang direview oleh penutur bahasa untuk menilai tingkat similaritas, dan mencermati efektivitas algoritma similaritas semantik dalam mengukur kemiripan kalimat bahasa Indonesia yang ada dalam korpus. Kemiripan kata diperoleh dari keterkaitan kata dalam bentuk sinonim, hiponim dan meronim sebagai basis pengetahuan. Penelitian ini menunjukkan bahwa penggunaan basis pengetahuan tersebut meningkatkan skor similaritas kalimat yang mengandung kata-kata yang berkaitan secara leksikal. Pada penelitian ini dihitung korelasi antara skor similaritas hasil perhitungan algoritma dengan skor kemiripan kalimat sebagaimana dipersepsikan oleh penutur bahasa. Tiga macam algoritma perhitungan telah diujicoba. Perhitungan similaritas menggunakan persentase jumlah kemunculan kata yang sama memberikan angka korelasi sebesar 0,7128. Angka korelasi untuk perhitungan similaritas menggunakan fungsi kosinus adalah sebesar 0,7408. Sedangkan perhitungan similaritas menggunakan algoritma semantik yang memperhatikan keterkaitan kata memberikan tingkat korelasi tertinggi sebesar 0,7508.
... While little NLP has been performed for Malay, Indonesian has seen some interesting work in the past few years. Adriani et al. (2007) examined stemming Indonesian, somewhat overlapping with Baldwin and Awab above, evaluating on the information retrieval testbed from Asian et al. (2004). Recently, a probabilistic parser of Indonesian has been developed, as discussed in Gusmita and Manu-rung (2008), and used for information extraction and question answering (Larasati and Manurung, 2007). ...
Article
Full-text available
We develop a data set of Malay lexemes la-belled with count classifiers, that are attested in raw or lemmatised corpora. A maximum entropy classifier based on simple, language-inspecific features generated from context to-kens achieves about 50% F-score, or about 65% precision when a suite of binary classi-fiers is built to aid multi-class prediction of headword nouns. Surprisingly, numeric fea-tures are not observed to aid classification. This system represents a useful step for semi-supervised lexicography across a range of lan-guages.
Thesis
Full-text available
In researches about Text Processing, Data Mining, Knowledge Data Discovery and about text in general, the task of processing derivative word into stem word is very important, because processing non-stem word can cause mistakes and deviations in the result. In this research, the stemming process is done by a stemmer made based on a rulested in the form of an XML (Extensible Markup Language) file in hope that the ruleset is easily customized. The stemmer itself is a parser using the FSA (Finite State Automata) principle, due to the fact that Indonesian language structure might contain double prefix or suffix. There are three evaluations done in this research. The first evaluation results in 84.33 % accuracy from 25269 words being tested. The second evaluation is an improvement from the first evaluation especially on the ruleset structures, resulting in 87.22% accuracy. The third evaluation is focused on adding information into the ruleset,resulting in 90.25% accuracy from 25269 words being tested.
Article
This paper presents our recent work in regard to building Large Vocabulary Continuous Speech Recognition (LVCSR) systems for the Thai, Indonesian, and Chinese languages. For Thai, since there is no word boundary in the written form, we have proposed a new method for automatically creating word-like units from a text corpus, and applied topic and speaking style adaptation to the language model to recognize spoken-style utterances. For Indonesian, we have applied proper noun-specific adaptation to acoustic modeling, and rule-based English-to-Indonesian phoneme mapping to solve the problem of large variation in proper noun and English word pronunciation in a spoken-query information retrieval system. In spoken Chinese, long organization names are frequently abbreviated, and abbreviated utterances cannot be recognized if the abbreviations are not included in the dictionary. We have proposed a new method for automatically generating Chinese abbreviations, and by expanding the vocabulary using the generated abbreviations, we have significantly improved the performance of spoken query-based search.
Conference Paper
Full-text available
Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarization, and text classification. For example, English stemming reduces the words "computer," "computing," "computation," and "computability" to their common morphological root, "comput-." In text search, this permits a search for "computers" to find documents containing all words with the stem "comput-." In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. This work surveys existing techniques for stemming Indonesian words to their morphological roots, presents our novel and highly accurate CS algorithm, and explores the effectiveness of stemming in the context of general-purpose text information retrieval through ad hoc queries.
Article
Full-text available
This paper serves as an introduction to the research described in detail in the remainder of the volume. The next section provides a summary of the retrieval background knowledge that is assumed in the other papers. Section 3 presents a short description of each track|a more complete description of a track can be found in that track's overview paper in the proceedings. The nal section looks forward to future TREC conferences
Article
The first Text REtrieval Conference (TREC-1) was held in early November 1992 and was attended by about 100 people working in the 25 participating groups. The goal of the conference was to bring research groups together to discuss their work on a new large test collection. There was a large variety of retrieval techniques reported on, including methods using automatic thesaurii, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching. As results had been run through a common evaluation package, groups were able to compare the effectiveness of different techniques, and discuss how differences among the sytems affected performance.
Article
The Text REtrieval Conference is a workshop series designed to encourage research on text retrieval for realistic applications by providing large test collections, uniform scoring procedures and a forum for organizations interested in comparing results. TREC contains two main retrieval tasks plus optional subtasks that allow participants to focus on particular common subproblems in retrieval. The emphasis on individual experiments evaluated in a common setting has proven to be very successful. In the six years since the beginning of TREC, the state of the art in retrieval effectiveness has approximately doubled, and technology transfer among research labs and between research systems and commercial products has accelerated. In addition, TREC has sponsored the first large-scale evaluations of Chinese language retrieval, retrieval of speech and retrieval across different languages.
Article
Contents 1 Introduction 1 2 A Purely Rule-based Stemmer for Bahasa Indonesia 3 2.1 Morphological Structure of Bahasa Indonesia Words . . . . . . . . . . . . . . . . . 3 2.2 The Porter Stemming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Porter Stemmer for Bahasa Indonesia . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Evaluation of the Stemming Algorithm 11 3.1 Stemmer Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 The Paice Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.2 The Paice Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Inflectional Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Derivational Structure . . . . . . . . . . .
Article
The majority of information retrieval experiments are evaluated by measures such as average precision and average recall. Fundamental decisions about the superiority of one retrieval technique over another are made solely on the basis of these measures. We claim that average performance figures need to be validated with a careful statistical analysis and that there is a great deal of additional information that can be uncovered by looking closely at the results of individual queries. This paper is a case study of stemming algorithms which describes a number of novel approaches to evaluation and demonstrates their value.
Eighth TREC conference (TREC-8)
  • E M Voorhees
  • D K Harman
E. M. Voorhees and D.K. Harman. Eighth TREC conference (TREC-8). In E.M. Voorhees and D.K. Harman (editors), Proceedings of the 8th Text REtrieval Conference (TREC-8), pages 1–24. NIST Spe-cial Publication 500-246, 1999.
  • Harman
Harman (editors), Proceedings of the 6th Text REtrieval Conference (TREC-6), pages 1-24. NIST Special Publication 500-240, 1997.
  • Harman
Harman (editors), Proceedings of the 9th Text REtrieval Conference (TREC-6), pages 1-14. NIST Special Publication 500-249, 2000.