Content uploaded by Phayung Meesad
Author content
All content in this area was uploaded by Phayung Meesad on Dec 17, 2016
Content may be subject to copyright.
1
TLS-ART: Thai Language Segmentation by Automatic
Ranking Trie
Chalermpol Tapsai1, Phayung Meesad2and Choochart Haruechaiyasak3
1,2Faculty of Information Technology
King Mongkut’s University of Technology North Bangkok, Thailand
3National Electronics and Computer Technology Center, Thailand
1chalermpol.t@email.kmutnb.ac.th,2phayung.m@it.kmutnb.ac.th,
3Choochart.Haruechaiyasak@nectec.or.th
Abstract: Thai language is a non-segmentation Natural Language
(NL) in which all words continuously present in sentences without
any delimiters. It is difficult for Word Segmentation (WS) to process.
Thai WS programs were developed and continuously improved by
many researchers. The most widely used is Thai Lexeme Tokenizer
(LexTo) using trie structure and longest match technique. LexTo
works well but has 2 main disadvantages; (1) dictionary size is too
big and (2) too many excessive matching dispensable words before
the correct word is found. In this research, Thai Language Segmenta-
tion using Automatic Ranking Trie (TLS-ART) is proposed. TLS-ART
uses Word Usage Frequency (WUF) to exclude unused words from
the dictionary and reorganize words in trie structure to reduce match-
ing task which significantly improves efficiency. The experimental re-
sults showed that accuracy, precision, recall, and f-measure values are
comparable to LexTo; however, the dictionary size is 86.07% smaller
and matching task 12.73% decrease.
Keywords: Natural Language; TLS-ART; Thai Language Segmenta-
tion; Trie; LexTo
1 Introduction
In the modern era like the present, computers are the devices that play an im-
portant role in human daily life, being used widely and clearly to claim that ma-
jority of human’s works today are inevitably related with computer processing.
To command a computer, we needs to understand ”Computer Language” that
is the special language used for creating programs, a set of instructions which
instruct computers to read data as well as to process and display the results
2C. Tapsai, P. Meesad and C. Haruechaiyasak
according to user needs. Although many computer languages have been de-
veloped to mimic the syntax more closely to human language, these computer
languages are still hard to understand by non-technician users. In addition,
unexperienced programmers may take long time to learn how to develop an
efficient computer program. These problems inspire to instead of making hu-
man to understand computer language; a better way is to make computers to
understand human Natural Language, which is used in our everyday life. This
concept will help users to command computer with their own language and
express their requirements correctly without extra training.
On the contrary, many Natural Language Processing (NLP) techniques that have
been developed in numerous ways are not significantly progressive or widely
used. This dues to various problems i.e., diversity of natural language which is
different for each race, country or region where people live. Moreover, natural
languages are complex, some words may have multiple types and meanings.
One sentence can be interpreted to more than one meaning. Moreover, one an-
swer can be a result of different sentences because of the rhetoric or familiarity
of each user. This is the main cause of inaccurate word segmentation in NLP
and still a major obstacle of successful research in this field.
In general case, NLP techniques include 4 major steps: 1) Lexical Analysis, 2)
Syntactic Analysis, 3) Semantic Analysis, and 4) Output Transformation. Lexi-
cal Analysis is an important process which analyses natural language sentences
that is split into small units called Token together with type and essential in-
formation used by next step. Faulty analysis and segmented wrong words will
contribute to wrong interpretation, incorrect meaning, and erroneous output
results, consequently. Especially in non-segmentation languages such as Thai,
Laos, Burmese, Chinese, Japanese, Korean, etc., all words in sentences are writ-
ten in a cursive without spaces or special characters distinguish between words.
They are complex and highly possible to segment wrong words.
In case of Thai, many researchers have developed word segmentation algo-
rithms by using various techniques. For example, Chaloenpomsawat et. al. [3]
used a feature based approach with RIPPER and Winnow learning algorithms.
Besides, Henasanangul et. al. [7] used string matching and word identification
in dictionary to identify unknown-word boundary of partially hidden words
and explicit unknown words. In addition, Tepdang et. al. [14] improved Thai
word segmentation with Named Entity Recognition by using the Conditional
Random Fields (CRFs) algorithm for training and recognizing Thai named en-
tities. Moreover, Suwannawach et. al. [13] used Maximum Matching and Tri-
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 3
gram Technique. Haruechaiyasak et. al. [6] conducted experiments comparing
performance between the DCB method with Trie algorithm technique and MLB
method with 4 techniques: Naive Bayes (NB), Decision Tree (DT), Support Vec-
tor Machine (SVM) and Conditional Random Field (CRF). The result showed
that the DCB with Trie algorithm and MLB with CRF technique gave the best
results in Precision and Recall measures.
Since 2003, many TLS programs were developed and distributed to public usage
and one of the most illustrious TLS programs is LexTo. By using DCB method
with Trie algorithm and the longest-word matching technique, LexTo can ana-
lyze sentences and split Thai words with high accuracy, but there are two main
problems. Firstly, size of dictionary is too big especially when more than 40,000
words are included with a large number of unused words while the necessary
and frequently used words are not stored. This results in many unknown words
and forces users to add more words to dictionary. Secondly, the word organiza-
tion is not efficient in Trie to reduce matching task, causing excessive matching
with dispensable words and taking a long time to find a word. According to
the mentioned problems, Thai Language Segmentation by Automatic Ranking
Trie (TLS-ART) is proposed herein to improve Trie by using actual Word Usage
Frequency (WUF) of Thai words to exclude unused words from dictionary and
reorganized words in Trie, seeking more frequently used words before less used
words to reduce matching task.
The remainder of this article is organized as follows. Section 2 presents overview
researches and problems related to NLP. Section 3 concerns with natural lan-
guage processing concepts. Section 4 illustrates detailed presentation of TLS-
ART. Section 5 shows the researchs results. Finally, Section 6 concludes remarks
as well as future research direction.
2 Related Work and Existing Problems
2.1 Research in NLP
Recently there are many researchers have conducted a study in many different
ways of NLP. For example, in [4] the researchers studied on NLP to conclude
requirements specification of software handbook written in unfix-pattern natu-
ral language and translate to a format defined language. In addition, automatic
indexing document by the number of occurrences of each substring in the docu-
ment and tree structure were studied by [15]; [11] proposed a technique to clas-
sify legal contract agreement documents by using NLP. Moreover, NLP interface
4C. Tapsai, P. Meesad and C. Haruechaiyasak
with computer systems to retrieve information from databases have been stud-
ied by using various techniques and algorithms, such as, LUNAR [16], FREyA
[8], and NLKBIDB [10]. These technique cove natural language in both simple
sentences and negation sentences including words, ”outside”, ”exclude”, ”does
not”, ”not”, ”no” and so on.
2.2 Existing Problems
Androutsopoulos et. al. [1] mentioned that key factors in development of NLP
is the expertise in linguistics and specialisation of researchs work. The lack of
expertise would hinder progression of NLP researches and developments. This
is consistent with Rodolfo et. al. [9] that mentioned four major problems often
occurring from the use of Natural Language to Interface Database as follows:
1) Various grammatical forms of Natural Language; 2) Missing of some impor-
tant words to convey meaning of sentence; 3) Querying for information which
relates to many tables and using of aggregate function; and 4) Problems caused
by human errors.
3 Methods and Techniques of NLP
For more than 40 years, NLP have been conducted in many researches to facil-
itate computer utilization by using numerous methods and techniques. Main
processes in NLP can be divided into four steps [10] as shown in Figure 1.
Fig. 1: Natural Language Processing steps
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 5
1. Lexical Analysis: This step analyses natural language sentences by splitting
into small items each called Token. In addition, the Tokens is identified
types and some essential information will be used in the next step.
2. Syntactic Analysis: In this step, all tokens are parsed with predefined sen-
tence structure (Syntax) for validity checking and provided some informa-
tion to be used in the meaning analysis process.
3. Semantic Analysis: The semantic analysis process interprets the meaning of
a sentence by parsing information, which derives from the previous step
with a semantic structure such as an ontology or a semantic web structure
to provide some data that represent the meaning of a sentence.
4. Output Transformation Process: This step transforms outputs derived from
Semantic Analysis into the results that meet the objectives of targets work,
such as SQL commands for information retrieval from databases.
As mentioned before, Thai language which is a non-segmentation Natural Lan-
guage, Word Segmentation in lexical analysis is a very important process due to
the difficulty when splitting words from sentences. If the analysis is not effec-
tive enough, it will produce wrong results. The consequent process, i.e., syntax
analysis and semantic analysis will inevitably produce wrong output too. At
present, some lexical analysis systems are available for public usage. For ex-
ample, WordNet is a system that can analyze English words with a large online
database. For Thai, LexTo is a program developed by National Electronics and
Computer Technology Center (NECTEC) widely used for Thai Word Segmenta-
tion.
4 The Proposed TLS-ART
The main idea of this research is to improve the efficiency of dictionary based
method by exclude excessive words from dictionary and organized words in
Trie. The proposed technique, Thai Language Segmentation by Automatic
Ranking Trie or TLS-ART, employs actual usage frequency to reduce the dic-
tionary size and number of matching tasks for splitting and identifying words
in sentences. Figure 2 shows the flowchart of the proposed TLS-ART. There are
three steps in the proposed technique as shown in Figure 3.
1. Dataset Preparation: In this step data sets are prepared to build dictionary
and Trie. The dataset used for this research is a set of Thai Language sam-
ple text files, collected from actual daily life usage of Thai people. The sam-
6C. Tapsai, P. Meesad and C. Haruechaiyasak
Fig. 2: TLS-ART flowchart
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 7
Fig. 3: TLS-ART Research Steps
8C. Tapsai, P. Meesad and C. Haruechaiyasak
Table 1: Number of sample websites and text files in each category of Dataset
Category No.Websites No.Training Files No.Test Files
Economics 8 200 20
Social 8 200 20
Political 8 200 20
Entertainment 8 200 20
Chat room 8 200 20
Others 8 200 20
Total 48 1200 120
ple text files were created from sentences, randomly collected from popular
websites and conversation dialog chats covering all major fields including
economics, social, political, entertainment, and others. There were 48 web-
sites and 1,320 files collected as shown in Table 1. From Table 1, the dataset
was divided into two sets: Training Set used for Word Usage Analysis pro-
cess and Test Set used for TLS-ART Evaluation process.
2. Word Usage Analysis: This process analyses texts in Training Dataset and
count for number of appearance of each word to provide Word Usage Fre-
quency (WUF).
3. Create dictionary and Trie: This process saves words and WUF of each word
to dictionary and creates Trie by placing ordered words based on usage
frequency from high to low.
4. TLS-ART Processing and Evaluation: The main task of TLS-ART processing
is to parse input text files from Test Dataset with Trie, comparing character
by character to find a word with longest match and count the number of
words appearance used for Improving Trie and dictionary. In case of no
matched word is found, the unknown string will be shown to users for
verification and add as a new word to the dictionary and Trie. To prove
the effectiveness, TLS-ART is compared with LexTo by using Test Dataset
which are 100 text files randomly collected from popular websites in eight
categories as shown in Table 1.
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 9
Table 2: Dictionary size and number of link search using by TLS-ART campared with Lexto
Dictionary No. words No.link search
LexTo 42,222 1,060,514
TLS-ART 5,881 925,467
Decrease 86.07% 12.73%
Table 3: Performance Evaluation
Techniques Accuracy Precision Recall F-measure
LexTo 0.935 0.957 0.976 0.967
TLS-ART 0.936 0.958 0.976 0.967
5 Experimental Results
Experiments for the performance comparisons between word segmentation us-
ing TLS-ART and LexTo shows that TLS-ART can reduce the size of dictionary
from 42,222 words to 5,881 words or 86.07%. In addition, the number of link
search used by TLS-ART is 925,467 or 12.73% less than LexTo while accuracy,
precision, recall and f-measure values are nearly equal as shown in Tables 2 and
3.
6 Conclusion, Discussion and Future Work
To reduce matching task and improve efficiency of Thai segmentation, in this
research, Thai Language Segmentation using Automatic Ranking Trie or TLS-
ART is proposed. Word Usage Frequency (WUF) is used to exclude unused
words from the dictionary and reorganize words in trie structure. Experimen-
tal results showed that TLS-ART can significantly reduce the size of dictionary.
Besides, reorganization of words in Trie using WUF reduces the number of Link
Search explicitly.
From this study, it is observed that almost of unknown words are specific name
of people, places, things and some other rarely used vocabulary. Moreover,
some specific names may include other words as their parts. It is a challenge
issue for future research to segment and identify these unknown words in a
correct way.
10 C. Tapsai, P. Meesad and C. Haruechaiyasak
References
[1] Androutsopoulos, G., Ritchie, D., Thanisch, P.: Natural Language Inter-
faces to Databases-An Introduction, Natural Language Engineering, 1, 1,
pp. 29–81, 1995
[2] Al-Suwaiyel, M., Horowitz, E.: Algorithms for trie compaction, ACM
Trans. Database Syst, 9, 2, pp. 243–263, 1984
[3] Chaloenpomsawat, P.: Feature-Based Thai Word Segmentation, , Master The-
sis, Chulalongkorn University, 1998
[4] Fatwanto, A.: Software Requirements Specification Analysis Using Natu-
ral Language Processing Technique, IEEE Quality in Research, 2013
[5] Fellbaum, C.: WordNet and wordnets, In: Brown, K. et al. (eds.), Encyclo-
pedia of Language and Linguistics, 2nd ed., Oxford: Elsevier, pp. 665–670,
2005 [http://wordnetweb.princeton.edu/perl/webwn]
[6] Haruechaiyasak, C., Kongyoung, S. and Dailey, M.: A Comparative Study
on Thai Word Segmentation Approaches, In: IEEE Proceedings of 5th Inter-
national Conference on ECTI-CON 2008, pp. 125–128, 2008
[7] Henasanangul, T., Seresangtakul, P.: Thai Text with Unknown Word Seg-
mentation Using the Word Identification, Khonkaen University Research
Journal, 6, 2, pp. 48–57, 2006
[8] Miguel Liopis, A. f.: Computer Standards and interfaces How to make a
natural language interface to query datasets accessible to everyone: An
Example, 2012
[9] Rodolfo, A., Pazos, R., Juan, J., Gonzalez, B., Marco, A., Aguirre, L.: Se-
mantic Model for Improving the Performance of Natural Language In-
terfaces to Databases, In: Advances in Artificial Intelligence: 10th Mexican
International Conference on Artificial Intelligence, MICAI 2011, Vol. 7094, pp.
277-290, 2011
[10] Shah, A., Pareek, J., Patel, H., Panchal, N.: NLKBIDB - Natural language
and keyword based interface to database, In: IEEE International Conference
on Advances in Computing, Communications and Informatics (ICACCI 2013),
pp. 1569–1576, 2013
[11] Slankas, J., Williams, L.: Classifying Natural Language Sentences for Pol-
icy. IEEE International Symposium the Policies for Distributed Systems and
Networks (POLICY), pp. 34–36, 2012
[12] Song, P., Shu, A., Phipps, D.: Language Without Words: A Pointillist
Model for Natural Language Processing, SCIS-ISIS 2012, Kobe, Japan,
2012
[13] Suwannawach, P. Thai Word Segmentation Improvement using Maximum
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 11
Matching and Tri-gram Technique, Master Thesis, King Monkut’s Institute
of Technology Ladkrabang, 2012
[14] Tepdang, S., Haruechaiyasak, C., Kongkachandra, R.: Improving Thai
word segmentation with Named Entity Recognition. In: International
Symposium on Communications and Information Technologies (ISCIT), 2010
[15] Todsanai, C.: An automatic indexing technique for Thai texts using fre-
quent max substring, In: IEEE Eighth International Symposium on Natural
Language Processing, 2009
[16] Woods, W.A., Kaplan, R.M., Webber, B.L.: The Lunar Sciences Natural
Language Information System, Final Report, BBN Report 2378, Cambridge,
Massachusetts: Bolt Beranek and Newman Inc., 1972