Conference PaperPDF Available

TLS-ART: Thai Language Segmentation by Automatic Ranking Trie

Authors:

Abstract and Figures

Thai language is a non-segmentation Natural Language (NL) in which all words continuously present in sentences without any delimiters. It is difficult for Word Segmentation (WS) to process. Thai WS programs were developed and continuously improved by many researchers. The most widely used is Thai Lexeme Tokenizer (LexTo) using trie structure and longest match technique. LexTo works well but has 2 main disadvantages; (1) dictionary size is too big and (2) too many excessive matching dispensable words before the correct word is found. In this research, Thai Language Segmentation using Automatic Ranking Trie (TLS-ART) is proposed. TLS-ART uses Word Usage Frequency (WUF) to exclude unused words from the dictionary and reorganize words in trie structure to reduce match- ing task which significantly improves efficiency. The experimental results showed that accuracy, precision, recall, and f-measure values are comparable to LexTo; however, the dictionary size is 86.07% smaller and matching task 12.73% decrease.
Content may be subject to copyright.
1
TLS-ART: Thai Language Segmentation by Automatic
Ranking Trie
Chalermpol Tapsai1, Phayung Meesad2and Choochart Haruechaiyasak3
1,2Faculty of Information Technology
King Mongkut’s University of Technology North Bangkok, Thailand
3National Electronics and Computer Technology Center, Thailand
1chalermpol.t@email.kmutnb.ac.th,2phayung.m@it.kmutnb.ac.th,
3Choochart.Haruechaiyasak@nectec.or.th
Abstract: Thai language is a non-segmentation Natural Language
(NL) in which all words continuously present in sentences without
any delimiters. It is difficult for Word Segmentation (WS) to process.
Thai WS programs were developed and continuously improved by
many researchers. The most widely used is Thai Lexeme Tokenizer
(LexTo) using trie structure and longest match technique. LexTo
works well but has 2 main disadvantages; (1) dictionary size is too
big and (2) too many excessive matching dispensable words before
the correct word is found. In this research, Thai Language Segmenta-
tion using Automatic Ranking Trie (TLS-ART) is proposed. TLS-ART
uses Word Usage Frequency (WUF) to exclude unused words from
the dictionary and reorganize words in trie structure to reduce match-
ing task which significantly improves efficiency. The experimental re-
sults showed that accuracy, precision, recall, and f-measure values are
comparable to LexTo; however, the dictionary size is 86.07% smaller
and matching task 12.73% decrease.
Keywords: Natural Language; TLS-ART; Thai Language Segmenta-
tion; Trie; LexTo
1 Introduction
In the modern era like the present, computers are the devices that play an im-
portant role in human daily life, being used widely and clearly to claim that ma-
jority of human’s works today are inevitably related with computer processing.
To command a computer, we needs to understand ”Computer Language” that
is the special language used for creating programs, a set of instructions which
instruct computers to read data as well as to process and display the results
2C. Tapsai, P. Meesad and C. Haruechaiyasak
according to user needs. Although many computer languages have been de-
veloped to mimic the syntax more closely to human language, these computer
languages are still hard to understand by non-technician users. In addition,
unexperienced programmers may take long time to learn how to develop an
efficient computer program. These problems inspire to instead of making hu-
man to understand computer language; a better way is to make computers to
understand human Natural Language, which is used in our everyday life. This
concept will help users to command computer with their own language and
express their requirements correctly without extra training.
On the contrary, many Natural Language Processing (NLP) techniques that have
been developed in numerous ways are not significantly progressive or widely
used. This dues to various problems i.e., diversity of natural language which is
different for each race, country or region where people live. Moreover, natural
languages are complex, some words may have multiple types and meanings.
One sentence can be interpreted to more than one meaning. Moreover, one an-
swer can be a result of different sentences because of the rhetoric or familiarity
of each user. This is the main cause of inaccurate word segmentation in NLP
and still a major obstacle of successful research in this field.
In general case, NLP techniques include 4 major steps: 1) Lexical Analysis, 2)
Syntactic Analysis, 3) Semantic Analysis, and 4) Output Transformation. Lexi-
cal Analysis is an important process which analyses natural language sentences
that is split into small units called Token together with type and essential in-
formation used by next step. Faulty analysis and segmented wrong words will
contribute to wrong interpretation, incorrect meaning, and erroneous output
results, consequently. Especially in non-segmentation languages such as Thai,
Laos, Burmese, Chinese, Japanese, Korean, etc., all words in sentences are writ-
ten in a cursive without spaces or special characters distinguish between words.
They are complex and highly possible to segment wrong words.
In case of Thai, many researchers have developed word segmentation algo-
rithms by using various techniques. For example, Chaloenpomsawat et. al. [3]
used a feature based approach with RIPPER and Winnow learning algorithms.
Besides, Henasanangul et. al. [7] used string matching and word identification
in dictionary to identify unknown-word boundary of partially hidden words
and explicit unknown words. In addition, Tepdang et. al. [14] improved Thai
word segmentation with Named Entity Recognition by using the Conditional
Random Fields (CRFs) algorithm for training and recognizing Thai named en-
tities. Moreover, Suwannawach et. al. [13] used Maximum Matching and Tri-
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 3
gram Technique. Haruechaiyasak et. al. [6] conducted experiments comparing
performance between the DCB method with Trie algorithm technique and MLB
method with 4 techniques: Naive Bayes (NB), Decision Tree (DT), Support Vec-
tor Machine (SVM) and Conditional Random Field (CRF). The result showed
that the DCB with Trie algorithm and MLB with CRF technique gave the best
results in Precision and Recall measures.
Since 2003, many TLS programs were developed and distributed to public usage
and one of the most illustrious TLS programs is LexTo. By using DCB method
with Trie algorithm and the longest-word matching technique, LexTo can ana-
lyze sentences and split Thai words with high accuracy, but there are two main
problems. Firstly, size of dictionary is too big especially when more than 40,000
words are included with a large number of unused words while the necessary
and frequently used words are not stored. This results in many unknown words
and forces users to add more words to dictionary. Secondly, the word organiza-
tion is not efficient in Trie to reduce matching task, causing excessive matching
with dispensable words and taking a long time to find a word. According to
the mentioned problems, Thai Language Segmentation by Automatic Ranking
Trie (TLS-ART) is proposed herein to improve Trie by using actual Word Usage
Frequency (WUF) of Thai words to exclude unused words from dictionary and
reorganized words in Trie, seeking more frequently used words before less used
words to reduce matching task.
The remainder of this article is organized as follows. Section 2 presents overview
researches and problems related to NLP. Section 3 concerns with natural lan-
guage processing concepts. Section 4 illustrates detailed presentation of TLS-
ART. Section 5 shows the researchs results. Finally, Section 6 concludes remarks
as well as future research direction.
2 Related Work and Existing Problems
2.1 Research in NLP
Recently there are many researchers have conducted a study in many different
ways of NLP. For example, in [4] the researchers studied on NLP to conclude
requirements specification of software handbook written in unfix-pattern natu-
ral language and translate to a format defined language. In addition, automatic
indexing document by the number of occurrences of each substring in the docu-
ment and tree structure were studied by [15]; [11] proposed a technique to clas-
sify legal contract agreement documents by using NLP. Moreover, NLP interface
4C. Tapsai, P. Meesad and C. Haruechaiyasak
with computer systems to retrieve information from databases have been stud-
ied by using various techniques and algorithms, such as, LUNAR [16], FREyA
[8], and NLKBIDB [10]. These technique cove natural language in both simple
sentences and negation sentences including words, ”outside”, ”exclude”, ”does
not”, ”not”, ”no” and so on.
2.2 Existing Problems
Androutsopoulos et. al. [1] mentioned that key factors in development of NLP
is the expertise in linguistics and specialisation of researchs work. The lack of
expertise would hinder progression of NLP researches and developments. This
is consistent with Rodolfo et. al. [9] that mentioned four major problems often
occurring from the use of Natural Language to Interface Database as follows:
1) Various grammatical forms of Natural Language; 2) Missing of some impor-
tant words to convey meaning of sentence; 3) Querying for information which
relates to many tables and using of aggregate function; and 4) Problems caused
by human errors.
3 Methods and Techniques of NLP
For more than 40 years, NLP have been conducted in many researches to facil-
itate computer utilization by using numerous methods and techniques. Main
processes in NLP can be divided into four steps [10] as shown in Figure 1.
Fig. 1: Natural Language Processing steps
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 5
1. Lexical Analysis: This step analyses natural language sentences by splitting
into small items each called Token. In addition, the Tokens is identified
types and some essential information will be used in the next step.
2. Syntactic Analysis: In this step, all tokens are parsed with predefined sen-
tence structure (Syntax) for validity checking and provided some informa-
tion to be used in the meaning analysis process.
3. Semantic Analysis: The semantic analysis process interprets the meaning of
a sentence by parsing information, which derives from the previous step
with a semantic structure such as an ontology or a semantic web structure
to provide some data that represent the meaning of a sentence.
4. Output Transformation Process: This step transforms outputs derived from
Semantic Analysis into the results that meet the objectives of targets work,
such as SQL commands for information retrieval from databases.
As mentioned before, Thai language which is a non-segmentation Natural Lan-
guage, Word Segmentation in lexical analysis is a very important process due to
the difficulty when splitting words from sentences. If the analysis is not effec-
tive enough, it will produce wrong results. The consequent process, i.e., syntax
analysis and semantic analysis will inevitably produce wrong output too. At
present, some lexical analysis systems are available for public usage. For ex-
ample, WordNet is a system that can analyze English words with a large online
database. For Thai, LexTo is a program developed by National Electronics and
Computer Technology Center (NECTEC) widely used for Thai Word Segmenta-
tion.
4 The Proposed TLS-ART
The main idea of this research is to improve the efficiency of dictionary based
method by exclude excessive words from dictionary and organized words in
Trie. The proposed technique, Thai Language Segmentation by Automatic
Ranking Trie or TLS-ART, employs actual usage frequency to reduce the dic-
tionary size and number of matching tasks for splitting and identifying words
in sentences. Figure 2 shows the flowchart of the proposed TLS-ART. There are
three steps in the proposed technique as shown in Figure 3.
1. Dataset Preparation: In this step data sets are prepared to build dictionary
and Trie. The dataset used for this research is a set of Thai Language sam-
ple text files, collected from actual daily life usage of Thai people. The sam-
6C. Tapsai, P. Meesad and C. Haruechaiyasak
Fig. 2: TLS-ART flowchart
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 7
Fig. 3: TLS-ART Research Steps
8C. Tapsai, P. Meesad and C. Haruechaiyasak
Table 1: Number of sample websites and text files in each category of Dataset
Category No.Websites No.Training Files No.Test Files
Economics 8 200 20
Social 8 200 20
Political 8 200 20
Entertainment 8 200 20
Chat room 8 200 20
Others 8 200 20
Total 48 1200 120
ple text files were created from sentences, randomly collected from popular
websites and conversation dialog chats covering all major fields including
economics, social, political, entertainment, and others. There were 48 web-
sites and 1,320 files collected as shown in Table 1. From Table 1, the dataset
was divided into two sets: Training Set used for Word Usage Analysis pro-
cess and Test Set used for TLS-ART Evaluation process.
2. Word Usage Analysis: This process analyses texts in Training Dataset and
count for number of appearance of each word to provide Word Usage Fre-
quency (WUF).
3. Create dictionary and Trie: This process saves words and WUF of each word
to dictionary and creates Trie by placing ordered words based on usage
frequency from high to low.
4. TLS-ART Processing and Evaluation: The main task of TLS-ART processing
is to parse input text files from Test Dataset with Trie, comparing character
by character to find a word with longest match and count the number of
words appearance used for Improving Trie and dictionary. In case of no
matched word is found, the unknown string will be shown to users for
verification and add as a new word to the dictionary and Trie. To prove
the effectiveness, TLS-ART is compared with LexTo by using Test Dataset
which are 100 text files randomly collected from popular websites in eight
categories as shown in Table 1.
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 9
Table 2: Dictionary size and number of link search using by TLS-ART campared with Lexto
Dictionary No. words No.link search
LexTo 42,222 1,060,514
TLS-ART 5,881 925,467
Decrease 86.07% 12.73%
Table 3: Performance Evaluation
Techniques Accuracy Precision Recall F-measure
LexTo 0.935 0.957 0.976 0.967
TLS-ART 0.936 0.958 0.976 0.967
5 Experimental Results
Experiments for the performance comparisons between word segmentation us-
ing TLS-ART and LexTo shows that TLS-ART can reduce the size of dictionary
from 42,222 words to 5,881 words or 86.07%. In addition, the number of link
search used by TLS-ART is 925,467 or 12.73% less than LexTo while accuracy,
precision, recall and f-measure values are nearly equal as shown in Tables 2 and
3.
6 Conclusion, Discussion and Future Work
To reduce matching task and improve efficiency of Thai segmentation, in this
research, Thai Language Segmentation using Automatic Ranking Trie or TLS-
ART is proposed. Word Usage Frequency (WUF) is used to exclude unused
words from the dictionary and reorganize words in trie structure. Experimen-
tal results showed that TLS-ART can significantly reduce the size of dictionary.
Besides, reorganization of words in Trie using WUF reduces the number of Link
Search explicitly.
From this study, it is observed that almost of unknown words are specific name
of people, places, things and some other rarely used vocabulary. Moreover,
some specific names may include other words as their parts. It is a challenge
issue for future research to segment and identify these unknown words in a
correct way.
10 C. Tapsai, P. Meesad and C. Haruechaiyasak
References
[1] Androutsopoulos, G., Ritchie, D., Thanisch, P.: Natural Language Inter-
faces to Databases-An Introduction, Natural Language Engineering, 1, 1,
pp. 29–81, 1995
[2] Al-Suwaiyel, M., Horowitz, E.: Algorithms for trie compaction, ACM
Trans. Database Syst, 9, 2, pp. 243–263, 1984
[3] Chaloenpomsawat, P.: Feature-Based Thai Word Segmentation, , Master The-
sis, Chulalongkorn University, 1998
[4] Fatwanto, A.: Software Requirements Specification Analysis Using Natu-
ral Language Processing Technique, IEEE Quality in Research, 2013
[5] Fellbaum, C.: WordNet and wordnets, In: Brown, K. et al. (eds.), Encyclo-
pedia of Language and Linguistics, 2nd ed., Oxford: Elsevier, pp. 665–670,
2005 [http://wordnetweb.princeton.edu/perl/webwn]
[6] Haruechaiyasak, C., Kongyoung, S. and Dailey, M.: A Comparative Study
on Thai Word Segmentation Approaches, In: IEEE Proceedings of 5th Inter-
national Conference on ECTI-CON 2008, pp. 125–128, 2008
[7] Henasanangul, T., Seresangtakul, P.: Thai Text with Unknown Word Seg-
mentation Using the Word Identification, Khonkaen University Research
Journal, 6, 2, pp. 48–57, 2006
[8] Miguel Liopis, A. f.: Computer Standards and interfaces How to make a
natural language interface to query datasets accessible to everyone: An
Example, 2012
[9] Rodolfo, A., Pazos, R., Juan, J., Gonzalez, B., Marco, A., Aguirre, L.: Se-
mantic Model for Improving the Performance of Natural Language In-
terfaces to Databases, In: Advances in Artificial Intelligence: 10th Mexican
International Conference on Artificial Intelligence, MICAI 2011, Vol. 7094, pp.
277-290, 2011
[10] Shah, A., Pareek, J., Patel, H., Panchal, N.: NLKBIDB - Natural language
and keyword based interface to database, In: IEEE International Conference
on Advances in Computing, Communications and Informatics (ICACCI 2013),
pp. 1569–1576, 2013
[11] Slankas, J., Williams, L.: Classifying Natural Language Sentences for Pol-
icy. IEEE International Symposium the Policies for Distributed Systems and
Networks (POLICY), pp. 34–36, 2012
[12] Song, P., Shu, A., Phipps, D.: Language Without Words: A Pointillist
Model for Natural Language Processing, SCIS-ISIS 2012, Kobe, Japan,
2012
[13] Suwannawach, P. Thai Word Segmentation Improvement using Maximum
TLS-ART: Thai Language Segmentation by Automatic Ranking Trie 11
Matching and Tri-gram Technique, Master Thesis, King Monkut’s Institute
of Technology Ladkrabang, 2012
[14] Tepdang, S., Haruechaiyasak, C., Kongkachandra, R.: Improving Thai
word segmentation with Named Entity Recognition. In: International
Symposium on Communications and Information Technologies (ISCIT), 2010
[15] Todsanai, C.: An automatic indexing technique for Thai texts using fre-
quent max substring, In: IEEE Eighth International Symposium on Natural
Language Processing, 2009
[16] Woods, W.A., Kaplan, R.M., Webber, B.L.: The Lunar Sciences Natural
Language Information System, Final Report, BBN Report 2378, Cambridge,
Massachusetts: Bolt Beranek and Newman Inc., 1972
... In addition, facial recognition technology is frequently used to identify people in digital media like images or videos. It might count the number of individuals in a specific location and even detect the emotions of people passing by a certain point like measuring the happiness of those leaving the breakfast buffet [9,29,56]. NLP become a significant tool to provide personalized destination recommendations to visitors based on their preferences and interests. Regarding the abundance of digital data that was generated by the search history, social media activity, and online reviews of the visitors, NLP algorithms might analyse and interpret the data offering specific recommendations for the tourism attractions and their activities. ...
... Natural Language Processing[56] ...
Conference Paper
Full-text available
Cultural heritage is a significant resource of knowledge that allows us to inform about and enhance such communities in the cultural tourism market. In this respect, it is important to explore, with the enhancement of cultural identity, by adopting a people-centred approach that matches the needs and wants of various segmentations of the visitors, the host community, and the heritage site. The research aims to introduce a proposed Natural Language Processing framework by adopting a grounded theory-based research agenda. The proposed framework suggests points of interest and related services enhancing the experience of the heritage site’s visitors along with three phases: (1) a pre-visit to set the level of expectations and satisfaction (need and wants) and ensure a high quality of respect a cultural diversity; (2) an in-situ-visit to boost the interaction between the visitors’ mentality and the manifestations of the site’s cultural identity; and (3) a post-visit to valorise positive word-of-mouth and create a lifelong learning memory.
... Natural Language Processing Procedures[59] ...
Article
Full-text available
In the context of a more linked and globalized society, the significance of proficient cross-cultural communication has been increasing to a position of utmost importance. Language functions as a crucial medium that establishes connections among people, corporations, and countries, thus demanding the implementation of precise and effective translation systems. This comprehensive review paper aims to contribute to the evolving landscape of AI-driven language translation by critically examining the existing literature, identifying key debates, and uncovering areas of innovation and limitations where the primary objective -is to provide a nuanced understanding of the current state of AI-driven language translation, along with emphasizing the advancements, challenges, and ethical considerations. In this review, ongoing debates surrounding AI-driven language translations were actively involved. By evaluating different viewpoints and methodologies, insights into unresolved questions that contribute to broader discourse in the field were provided. The future trajectory of this study is the incorporation of cross-lingual dialect adaptability and the advancement of Artificial Intelligence translation systems, with a focus on prioritizing inclusion and cultural understanding.
... In a similar way human beings function with their intelligence, machines are programmed to operate into intelligence business platforms and works, in analysing the business scenarios and provide real-time data for managerial decision-making. Several high-profile Fortune 500 companies are engaged in applying AIs to gain advantage of the business situations, with the AI systems performing the ensuing actions with language identification, learning, scheduling and problem-solving (Chalermpol, Phayung, & Choochart, 2016). Application of automation and Information Technology (IoT) turned to be an integral part of science and engineering, and they have the very goal to create intelligent machines that support business and market. ...
... In a similar way human beings function with their intelligence, machines are programmed to operate into intelligence business platforms and works, in analysing the business scenarios and provide real-time data for managerial decision-making. Several high-profile Fortune 500 companies are engaged in applying AIs to gain advantage of the business situations, with the AI systems performing the ensuing actions with language identification, learning, scheduling and problem-solving (Chalermpol, Phayung, & Choochart, 2016). Application of automation and Information Technology (IoT) turned to be an integral part of science and engineering, and they have the very goal to create intelligent machines that support business and market. ...
... b. Dictionary component: This component uses word segmentation by LexTo+ [29]. Then, the word-segment of Thai is matched with the English word by using the API of Thai-English LEXiTRON dictionary [30]. ...
Article
Full-text available
p>This paper aims to analyze the learning behavior of Thai learners by using a computer-based learning system for English writing. Three main objectives were set: the development of a computer-based learning system, automatic behavior data collection, and learning behavior analytics. Firstly, the system is developed under a multidisciplinary idea that is designed to integrate two concepts between the self-regulated learning model and components of natural language processing. The integration design encourages self-learning in the digital learning environment and supports appropriate English writing by the provided component selection. Second, the system automatically collects the writing behavior of a group of Thai learners. The data collected are necessary input for the process of learning analytics. Third, the writing behaviors data were analyzed to find the learning behavioral patterns of the learners. For learning analytics, behavior sequential analysis was used to analyze the learning logs from the system. The 31 undergraduate students are participated to record writing behaviors via the system. The learning patterns in relation to grammatical skills were compared between three groups: basic, intermediate, and advanced levels. The learning behavior patterns of the three groups are different that use for reflecting learners and improving the learning materials or curriculum.</p
... Ranking Trie is a new technique, which arranges words in Trie structure by word usage frequency to improve the word segmentation efficiency. The research of Ranking Trie was conducted in 2016 [28]. In the first step, 1,320 text files were collected from popular websites and chat messages covering all major fields, including economics, social, political, entertainment, and others. ...
Conference Paper
This research presents Natural Language Semantic Model for Arithmetic Sentences (NLSMAS). Unlike other former research on Natural Language Processing, this research focuses on the arithmetic sentences, which are processing-type sentences and different from the general usage level. By using various techniques, this model solves the problem of Lexical Analysis, Semantic Analysis and process real-life usage arithmetic sentences which have plenty of syntaxes and various degrees of difficulty. Evaluation of model performance is done by inputting 12,000 arithmetic sentences, from the 300 samples, into the model for outputs. The results show that the model produces results with very high accuracy value at 98.68 percent. Read more detail, download abstract and full paper at : https://dl.acm.org/author_page.cfm?id=99659241307&coll=DL&dl=ACM&trk=0#
Article
Full-text available
The trie data structure has many properties which make it especially attractive for representing large files of data. These properties include fast retrieval time, quick unsuccessful search determination, and finding the longest match to a given identifier. The main drawback is the space requirement. In this paper the concept of trie compaction is formalized. An exact algorithm for optimal trie compaction and three algorithms for approximate trie compaction are given, and an analysis of the three algorithms is done. The analysis indicate that for actual tries, reductions of around 70 percent in the space required by the uncompacted trie can be expected. The quality of the compaction is shown to be insensitive to the number of nodes, while a more relevant parameter is the alphabet size of the key.
Article
Full-text available
This paper explores two separate questions: Can we perform natural language processing tasks without a lexicon?; and, Should we? Existing natural language processing techniques are either based on words as units or use units such as grams only for basic classification tasks. How close can a machine come to reasoning about the meanings of words and phrases in a corpus without using any lexicon, based only on grams? Our own motivation for posing this question is based on our efforts to find popular trends in words and phrases from online Chinese social media. This form of written Chinese uses so many neologisms, creative character placements, and combinations of writing systems that it has been dubbed the "Martian Language." Readers must often use visual queues, audible queues from reading out loud, and their knowledge and understanding of current events to understand a post. For analysis of popular trends, the specific problem is that it is difficult to build a lexicon when the invention of new ways to refer to a word or concept is easy and common. For natural language processing in general, we argue in this paper that new uses of language in social media will challenge machines' abilities to operate with words as the basic unit of understanding, not only in Chinese but potentially in other languages.
Conference Paper
Full-text available
Thai language is considered as a non-segmented language where words are a string of symbols without explicit word boundaries, and also the structure of written Thai language is highly ambiguous. This problem causes an indexing technique has become a main issue in Thai text retrieval. To construct an inverted index for Thai texts, an index terms extraction technique is usually required to segment texts into index term schemes. Although index terms can be specified manually by experts, this process is very time consuming and labor-intensive. Word segmentation is one of the many techniques that are used to automatically extract index terms from Thai texts. However, most of the word segmentation techniques require linguistic knowledge and the preparation of these approaches is time consuming. An n-gram based approach is another automatic index terms extraction method that is often used as indexing technique for Asian languages including Thai. This approach is language independent which does not require any linguistic knowledge or dictionary. Although the n-gram approach out performs many indexing techniques for Asian languages in term of retrieval effectiveness, the disadvantage of n-gram approach is it suffers from large storage space and long retrieval time. In this paper we present the frequent max substring mining to extract index terms from Thai texts. Our method is language-independent and it does not rely on any dictionary or language grammatical knowledge. Frequent max substring mining is based on text mining that describes a process of discovering useful information or knowledge from unstructured texts. This approach uses the analysis of frequent max substring sets to extract all long and frequently-occurred substrings. We aim to employ the frequent max substring mining algorithm to address the drawback of n-gram based approach by keeping only frequent max substrings to reduce disk space requirement for storing index terms and to reduce the retrieval time in or- der to deal with the rapid growth of Thai texts.
Conference Paper
Organizations derive policies from a wide variety of sources, such business plans, laws, regulations, and contracts. However, an efficient process does not yet exist for quickly finding or automatically deriving policies from uncontrolled natural language sources. The goal of our research is to assure compliance with established policies by ensuring policies in existing natural language texts are discovered, appropriately represented, and implemented. We propose a tool-based process to parse natural language documents, learn which statements signify policy, and then generate appropriate policy representations. To evaluate the initial work on our process, we analyze four data use agreements for a particular project and classify sentences as to whether or not they pertain to policy, requirements, or neither. Our k-nearest neighbor classifier with a unique distance metric had a precision of 0.82 and a recall of 0.81, outperforming weighted random guess, which had a precision of 0.44 and a recall of 0.46. The initial results demonstrate the feasibility of classifying sentences for policy and we plan to continue this work to derive policy elements from the natural language text.
Conference Paper
Understanding software requirements has been widely acknowledged as a crucial task in software development projects. This problem emerges since software requirements, which are originally specified using a particular natural language, are often ambiguous and incomplete. The condition will be completely different if the requirements are specified using formal language in which ambiguity and incompleteness could be obviously found and thus quickly anticipated. Transforming “natural” software requirements into a more formal specification may therefore reduce their ambiguity and incompleteness. This paper hence proposes a method to transform software requirements specified in a natural language to formal specification (in this context is object-oriented specification). The proposed method uses natural language processing technique.
Conference Paper
One of the major source of information is database which plays an important role in computer field. Almost all IT applications are storing and retrieving information from databases. There are various interfaces available to retrieve data such as form based, natural language and keyword based. Data retrieval from the database requires knowledge of database language like SQL [1]. The need of natural language arises from the fact that common people (other than SQL experts) are not used to writing SQL query. Therefore researchers have identified a new system called Natural Language Interface to Database (NLIDB) and Keyword Based Interface to Database (KBIDB). In this paper we have proposed architecture of a Natural Language and Keyword Based Interface for Database (NLKBIDB) which provides solution for syntactically correct and incorrect natural language input query. Our partial experiment of Lexical Analyzer and Keyword based interface on agriculture survey database solves 53% of syntactically incorrect query which will not be solved by natural language interface resulting in increase of rate of SQL query conversion.
Article
Natural Language Interfaces to Query Databases (NLIDBs) have been an active research field since the 1960s. However, they have not been widely adopted. This article explores some of the biggest challenges and approaches for building NLIDBs and proposes techniques to reduce implementation and adoption costs. The article describes {AskMe*}, a new system that leverages some of these approaches and adds an innovative feature: query-authoring services, which lower the entry barrier for end users. Advantages of these approaches are proven with experimentation. Results confirm that, even when {AskMe*} is automatically reconfigurable against multiple domains, its accuracy is comparable to domain-specific NLIDBs.
Article
Segmenting words in Thai language is a very difficult task since there is no distinguished clue such as blank, period and other punctuations as in English. Several previous researches employed dictionary as the main resource for consideration. However there still exist two problems including ambiguous words and unknown words. These unknown words can be categorized into two groups, -i.e., newly defined words and named entities. This paper presents an approach for improving the performance of Thai word segmentation by merging Named Entity Recognition (NER) to the Thai word segmentation. The Conditional Random Fields (CRFs) algorithm is applied for training and recognizing Thai named entities. The prefixes and suffixes of Thai named entities are selected as main features for learning the models. The performance evaluations are experimented by using the Thai standard word segmentation corpus, namely BEST2009, which consists of 5 million words. Various word-level grams (i.e., three, five and seven) are also employed to construct the Thai NER models. The experimental results show that the 7-gram NER model provides the best performance. Merging the proposed NER model to the Thai word segmentation called TLex (Thai Lexeme Analyzer) can improve the performance measured by F1-measure from 92.39% to 93.96%.
Article
This paper is an introduction to natural language interfaces to databases (NLIDBS). A brief overview of the history of NLIDBS is first given. Some advantages and disadvantages of NLIDBS are then discussed, comparing NLIDBS to formal query languages, form-based interfaces, and graphical interfaces. An introduction to some of the linguistic problems NLIDBS have to confront follows, for the benefit of readers less familiar with computational linguistics. The discussion then moves on to NLIDB architectures, portability issues, restricted natural language input systems (including menu-based NLIDBS), and NLIDBS with reasoning capabilities. Some less explored areas of NLIDB research are then presented, namely database updates, meta-knowledge questions, temporal questions, and multi-modal NLIDBS. The paper ends with reflections on the current state of the art.