ArticlePDF Available

Abstract and Figures

The process of gathering useful information from online messages has increased as more and more people use the Internet and other online applications such as Facebook and Twitter to communicate with each other. One of the problems in processing online messages is the high number of noisy texts that exist in these messages. Few studies have shown that the noisy texts decreased the result of text mining activities. On the other hand, very few works have investigated on the patterns of noisy texts that are created by Malaysians. In this study, a common noisy terms list and an artificial abbreviations list were created using specific rules and were utilized to select candidates of correct words for a noisy term. Later, the correct term was selected based on a bi-gram words index. The experiments used online messages that were created by the Malaysians. The result shows that normalization of noisy texts using artificial abbreviations list compliments the use of common noisy texts list.
Content may be subject to copyright.
147
Journal of ICT, 12, 2013, pp: 147159
NORMALIZATION OF NOISY TEXTS IN MALAYSIAN
ONLINE REVIEWS
Norlela Samsudin1, Mazidah Puteh2, Abdul Razak Hamdan3
and Mohd Zakree Ahmad Nazri4
1&2Faculty of Computer and Mathematical Science,
Universiti Teknologi MARA Terengganu,
Dungun, 23000, Terengganu, Malaysia
3&4Faculty of Information Science and Technology,
Universiti Kebangsaan Malaysia
43600, Bangi, Selangor, Malaysia
Corresponding author: norlela@tganu.uitm.edu.my1
ABSTRACT
The process of gathering useful information from online
messages has increased as more and more people use the Internet
and other online applications such as Facebook and Twitter to
communicate with each other. One of the problems in processing
online messages is the high number of noisy texts that exist in
these messages. Few studies have shown that the noisy texts
decreased the result of text mining activities. On the other hand,
very few works have investigated on the patterns of noisy texts
that are created by Malaysians. In this study, a common noisy
terms list and an arti cial abbreviations list were created using
speci c rules and were utilized to select candidates of correct
words for a noisy term. Later, the correct term was selected
based on a bi-gram words index. The experiments used online
messages that were created by the Malaysians. The result shows
that normalization of noisy texts using arti cial abbreviations list
compliments the use of common noisy texts list.
Keywords: Noisy texts; normalization of noisy texts; arti cial abbreviation
INTRODUCTION
The advancement of Internet technology causes a mass collection of online
documents from applications such as e-forums, blogs, Facebook and Twitter.
http://jict.uum.edu.my/
Journal of ICT, 12, 2013, pp: 147159
148
The online social media allow the users to communicate with each other in
an informal environment. Therefore, the online documents are lled with
out of vocabulary (OOV) terms or noisy texts and do not follow the usual
structure of a language. Knoblock, Lopresti, Roy and Subramaniam (2007)
de ne noisy text as “any kind of difference between the surface form of a
coded representation of the text and the intended, correct, or original text”.
Despite being noisy, online created documents contain important information
such as opinions about a particular product, service or political gure. Other
than that, customers often give feedbacks or comments about an organization
using online facility. Mining the online documents may reveal interesting
information for the survival of a company. Frequently Asked Question (FAQ)
is another application that received input from the customer via the online
application. Unfortunately, the noisy texts that exist in online messages lead
to inaccurate information in text processing activities. Therefore, processing
of online documents is necessary before any information gathering activities
from online created messages is executed. The following is an example of a
typical e-forum entry that is created by Malaysian:
budak kecik ni asyik sangat tengok 7 petala cinta. br lps tgk
citer ni (-____-)……. best citer ni!!! bc komen2 kt sini...yg
mana lost2 boleh faham balik... http://asdkfj.kasdf.dfjk.my “.
This message is lled with incorrect sentence structure, improper casing,
incorrect punctuation, misspelled words, mixed of terms from different
languages and creative use of emoticon. Work by Samsudin, Puteh and
Hamdan (2011) and Dey and Haque (2009) showed that the occurrence
of noisy texts reduced the accuracy value of opinion mining processing.
Similarly, Vinciarelli (2004) concluded that noisy text also affects text mining
activities. Other than that, experiments by Tang, Li, Cao, and Tang (2005) also
concluded that the terms extraction from electronic mails was improved by
35% to 45% after the emails had been cleaned from noisy terms.
Normalization of noisy texts in previous researchers uses resources mainly
from English language such as:
1) a standard parser which is used by Clark (2003), Foster, Wagner, and
Genabith (2008), Jing, Lopresti, and Shih (2003);
2) resources from Word Wide Web in Wong, Liu, and Bennamoun (2006);
3) English dictionaries in Wong, Leu, and Bennamoun (2006), Dey and
Haque (2008) or
4) speci c domain dictionary used by Kothari, Negi, Faruquie, Chakaravarthy,
and Subramaniam (2009).
http://jict.uum.edu.my/
149
Journal of ICT, 12, 2013, pp: 147159
Unfortunately, there is no such reference that is available for the Malay
language. In addition, most previous works try to solve noisy terms involving
words created from its phonetic sound such as ‘cu’, 2u, 2morrow, l8, or lol.
Malaysians rarely use these terms. This study shows that the top ve noisy
terms that are commonly used by Malaysians in online documents are tu (itu),
yg (yang), ni (ini), tak (tidak) and x (tidak). The shorter version of a term or
abbreviation is used in order to reduce key punching (especially when a mobile
hardware is used to create the message) and to speed up the communication
process. This project studied the pattern of abbreviations that Malaysians
used in online media and created arti cial abbreviations list to improve the
normalization process of noisy texts. In addition to that, a list of common
noisy texts that Malaysians normally used in online message was also created
and used in the normalization process.
BACKGROUND
Kobus (2008) identi ed three metaphors in cleaning noisy texts i.e. spell
checking metaphor, translation metaphor and speech recognition metaphor.
Spell checking metaphor assumes all out of dictionary words as noisy terms
and need to be corrected. This technique normally uses a speci c dictionary
to identify a noisy term. Most works in normalization of noisy texts adopt
this metaphor such as work by Toutanova and Moore (2002), Wong, Leu
et al. (2006), Choudhury et al. (2007) and Cook and Stevenson (2009).
Nevertheless, this method does not consider the context where the term is
used. The second metaphor assumes texts with noisy term as another language
and uses a speci c dictionary to translate these texts into the correct texts.
The researchers normally use statistic techniques to solve the problem such
as phrase-based statistical model by Aw, Zhang, Xiao and Su (2006) and
Hidden Markov Model in Choudhury et al. (2007) and Acharyya, Negi,
Subramaniam, and Roy (2008). The last metaphor is based on works to
convert speech notation into texts. Users of online communication normally
communicate in an informal manner. The use of texts which imitates the
phonetic sound of a word, such as fon (phone), 2nite (tonight) or cite (cerita),
is common in online communication. This method uses prede ned codes that
translate phonetic sound spelling to written texts spelling based of speci c
rules (Kobus, 2008).
One of the trends in online messages is using shortened words in the form
of acronym or abbreviation. Acronym is a word that is formed by combining
the initial letters from a group of words such as UUM (Universiti Utara
http://jict.uum.edu.my/
Journal of ICT, 12, 2013, pp: 147159
150
Malaysia), AF (Akademi Fantasia) and lol (laugh out loud). On the other
hand, abbreviation is a shortened form of a word such as gd (good), bst (best)
and kg (kampong). Constrain of a device due to the use of mobile phone as a
medium of communication and constrain of time cause online users to shorten
the spelling of texts in online messages. Several trends on how Malaysians
shorten the Malay terms have been identi ed in Hussin (2009) and Pustaka
(2008). This paper investigates the used of common noisy terms list and
arti cial abbreviations list to normalize noisy texts. To the knowledge of the
writer, this work is the rst attempt to normalized online messages that are
written by Malaysians.
METHODOLOGY
Preparation of data
The experiment requires a collection of online messages created by Malaysians.
In order to create this collection, 5000 e-forum entries, 5000 Twitter messages
and 5000 Facebook messages believed to be created by Malaysians were
manually extracted. As shown in Figure 1, the following lists were created
from these messages:
a) A list of noisy terms that occur more than three times in these documents.
About 4000 noisy terms have been identi ed and manually translated.
This list is known as NTTranslate.
b) A list of all correctly spelled words other than proper names. Items from
this list are merged with translation of noisy text from list (a). A total of
10550 words are listed. This list is named as CommonWords.
c) The contents of corrected spelled words from (b) are merged with a
list of Malay words taken from a digital dictionary. This list is known
as CorrectWords list and is used in the project to identify an out of
vocabulary (OOV) term.
d) The online documents were semi-automatically translated and veri ed.
A list that records the frequency of bi-gram words in the corpus was
created and used to select the most suitable term as a translation for a
noisy text. This list is known as Bi-Gram Index.
Another 100 online messages were extracted as testing data. Noisy texts
were tagged and translated manually. Other terms were tagged as correct
word, numbers, icon, link and symbol. These data were used to check the
effectiveness of normalization processes in this study.
http://jict.uum.edu.my/
151
Journal of ICT, 12, 2013, pp: 147159
Figure 1. Processing of online messages corpus.
Generating arti cial abbreviations list
A Malay term is made of several syllables. A syllable is the smallest unit of
a speech sound. Normally it is made from several combinations of a vowel
and consonants. For example word ‘kucing’ is a combination of two syllables
i.e ‘ku’ and ‘cing’. In addition to the normal consonant character, the Malay
language also adopts group consonants i.e. gh, kh, ny, ng, sy. The rules in
creating arti cial abbreviation manipulate the characters and syllable of a
particular word. In 2008, a guideline in creating SMS abbreviation in Malay
Language was published by Dewan Bahasa & Pustaka. Adopting these rules
and observation on the abbreviation pattern of the top 200 noisy texts, a list
of arti cial noisy texts is created. Rules that are related to manipulation of
characters are:
1. Remove all vowels such as in sklh (sekolah) and slr (seluar)
2. Use the rst character and the last character if either of them is not a
vowel such as yg (yang) and kg (kampong).
3. Replace the last character with the character ‘e’ if it is an ‘a’ such as ape
(apa) and berape (berapa).
4. Add character ‘k’ to the end of the word if the word is ended with
character ‘a’ such as bapak (bapa) and mintak (minta).
5. Drop the rst vowel if the word starts with a consonant such as sapa
(siapa), slalu (selalu)
6. Drop the last vowel if the last character is not a vowel such ank (anak)
and ingt (ingat).
7. Using the rst and the last character such as pi (pergi) and dn (dan).
Extract NT
5000 e-forum messages,
5000 Facebook entries,
5000 Twitter entries
Create Correct
Words
NTTranslate CommonWords
Malay Dictionary
CorrectWords
Translate
Messages
Bi-Gram Indexs
http://jict.uum.edu.my/
Journal of ICT, 12, 2013, pp: 147159
152
8. If the term ends with ‘ar’, replace it with the character ‘o’ such as sabo
(sabar) and terbako (terbakar)
9. If the term starts with ‘ha’, drop the character ‘h’ such as antu (hantu)
and ari (hari).
10. Using a character in replacement to a word with similar phonetic sounds
is also common. The following abbreviations are also added in the
list: w (why), x (tidak), n (dan), g (pergi), s (as), d (di), k (ok), u (you),
t (nanti)
The following rules manipulate the syllables of a word.
1. Use the rst syllable such as sem (semester).
2. Use the last syllable such as mak (emak) or ngan (dengan). If the new
last syllable ends with an ‘a’, replace it with ‘e’ such as je (sahaja) or te
(kita). In addition to that, if the character ends with an ‘a’, add character
‘k’ such as gak (jugak);
3. Use the rst character of each syllable in a word such as spt (seperti). If
the syllable starts with a group of consonant, the second character will
be used such as tgk (tengok);
In addition rules that are listed previously, the following rules that manipulate
the syllables and the characters are also adopted.
1. Use the rst character of each syllable + the last character (if the word
is a consonant) such as byk (banyak) and tgh (tengah).
2. Use the rst character and the last syllable such as bleh (boleh) or bru
(baru). If the new term ends with an ‘a’, replace it with the character ‘e’
such as bpe (berapa) and mne (mana).
3. Use the last syllable but replace the rst character of the last syllable
with the rst character of the word such as tak (tidak) and tgok (tengok).
Using CommonWords list, about 80,000 arti cial noisy texts were created
and named as Arti cial Abbreviation list.
Normalization process
Three experiments were conducted in this project. The rst experiment is
considered as the based experiment, where normalization of noisy text was
executed using the common noisy texts translation. If more than one translation
were identi ed, the correct term was selected at random. Figure 2 illustrates
the process.
http://jict.uum.edu.my/
153
Journal of ICT, 12, 2013, pp: 147159
Kukich (1992) suggests three stages in the normalization process of noisy
texts named Detect Noisy Terms, Identify Candidates and Choose Translation
a. In order to identify a noisy term, a word is compared to a list of dictionary
which contains correct words. All words that are not in the dictionary are
considered as noisy terms. The next step identi es the candidates of correct
words using a list of arti cial abbreviation which has been created using
rules that have been explained in the previous section. The last step identi es
the correct term based on the context where the word is used. This is done
by comparing the occurrences of the previous word. These steps made up
the second experiments as illustrated in Figure 3.
Figure 2. Using common NT list.
Figure 3. Normalization of noisy text using arti cial abbreviation.
In the third experiment, in addition to arti cial abbreviation, the common
noisy terms list is added as one of the references in identifying correct term
candidates as illustrated in Figure 4.
Raw
Document Detect and
Translate NT Next Term?
Processed
Document
Yes
No
NTTranslate
Raw
Document Detect NT
Correct
Words
Identify
Candidate Choose
Translation
Artificial
Abbreviation Bigram
Index
Next Term?
Processed
Document
Yes
No
http://jict.uum.edu.my/
Journal of ICT, 12, 2013, pp: 147159
154
Figure 4. Normalization of noisy text using common noisy text and arti cial
abbreviation list.
RESULT AND DISCUSSION
The purpose of this study is to check whether arti cial abbreviation lists and
common noisy text translation can improve the process to ‘clean’ noisy terms
in an online media message that were created by Malaysians. 100 online
messages that stated opinions about a particular movie had been extracted
from various e-forum, Facebook entries and Tweeter messages. These
messages contain between 11 and 170 words with an average of 60 words
per message. On average, 15 noisy texts were identi ed manually in every
message. Surprisingly, the system identi ed an average of 17 noisy texts in
every message. This is due to the use of English words that were not listed
in CorrectWords list. This list was created using common words in 15,000
online messages and a Malay dictionary. Therefore, words such as predictable,
private and characters were considered as noisy terms since these words did
not exist in CorrectWords list. In the researchers’ opinion, the English terms
that exist in CommonWords list are enough to identify the common English
words used by Malaysians in online messages. Unfortunately, that is not the
case as shown by an increase of 2% in noisy texts identi ed by the system.
Other than that, a proper noun, such as the name of a person or a movie that
was spelled without using an upper case letter as the rst character, was also
considered as a noisy term. Therefore, the number of noisy texts that was
identi ed in every experiment was higher than the number of noisy texts that
was identi ed manually. Correctly identi ed noisy text is noisy text that was
correctly identi ed and translated as identi ed and translated in the manual
process. Incorrect identi ed noisy text is a word that was not considered as
noisy text in the manual process or a word but was identi ed as noisy text
and translated wrongly. Table 1 shows the average percentage of correctly
identi ed noisy texts and the average percentage of incorrectly identi ed
noisy texts that were captured at each experiment.
Raw
Document Detect NT
Correct
Words
Identify
Candidate Choose
Translation
Artificial
Abbreviation Bigram
Index
Next Term?
Processed
Document
Yes
No
NTTranslate
http://jict.uum.edu.my/
155
Journal of ICT, 12, 2013, pp: 147159
Table 1
Results of Experiments
NTTranslate Arti cial
Abbreviation NTTranslate + Arti cial
Abbreviation
Correctly Identi ed NT 70% 42% 76%
Incorrect Identi ed NT 40% 58% 34%
The result of the experiment shows that 70% of noisy texts that were
identi ed in the messages may be corrected using the common noisy texts
list. On the other hand, only 42% of noisy texts can be corrected using the
list of arti cial abbreviation alone. Nevertheless, the result improved when
both lists were used. NTTranslate is a list of manually noisy text translation
which is extracted from 15000 online media messages created by Malaysians.
Therefore, common noisy texts were captured in this list and produced a better
result in the normalization of noisy texts as compared to using arti cial noisy
text list alone. Unfortunately, using only the common noisy terms list had
several set- backs. It failed to capture the relation between a word and its
previous word. Neither can it identify other creative short forms of a word that
were not commonly used. In addition, processing of noisy terms that consist
of a number is limited to the existence of the word in the common noisy text
list. This problem was tackled when arti cial abbreviation list is used. Even
though the arti cial abbreviation list solves the above problems, it cannot
recognize noisy terms that use phonetic similarity such as 2CU (to see you),
pilem ( lem), citer (cerita) and siyes (say yes). Other than phonetic sound, the
approach in this study also ignores the following type of noisy texts.
a) Abbreviation that is made from a combination of two or more word
such as dorg (dia orang) and pastu (selepas itu).
b) Identi cation of proper names such as the name of a person or the name
of a country. Currently, the algorithm assumes words that start with
a capital letter as proper name and hence, they will not be processed.
On the other hand, a proper name that starts with lower case letter is
assumed as noisy text.
c) Double meaning word. For example, sapa is a root word and is being
used in words such as menyapa or disapa. This word is considered as
a correct word and exists in a dictionary. Nevertheless, when it is used
in in a different context such as “kubur sapa ni? “ where the word sapa
is considered as a noisy term. The correct word is siapa. This situation
was not identi ed and corrected in the normalization process.
http://jict.uum.edu.my/
Journal of ICT, 12, 2013, pp: 147159
156
(d) Slang words such as ma, je, jee, le, bah, gezek and lu. Other than these
words, terms that indicate expressions, such as augh, err, haha, and
hehehehe are also ignored. These words are considered as correctly
identi ed noisy terms.
Other reasons for incorrect translation are:
 Typing errors such as the word tima in the phrase ‘tima aku tengok’
is supposed to be ‘time’. Since the word ‘tima’ is not considered as
common noisy term, it is not listed in NTTranslate, but the term is
listed in abbreviation list as the short term for word ‘terima’.
 Noisy texts from unlisted word in digital dictionary such as sgtle which
means sangatlah. This word occurs due to the additional suf x added
by the users.
 Creative words that the users used which are out of norm and so do
follow the usual pattern of noisy terms creation such as ritu (hari
itu),pes (peace) and asik (asyik)
CONCLUSION
This study showed that common noisy texts list and arti cial abbreviation list
were effective in the normalization of noisy terms where Malaysian online
messages were involved. Both lists are the main contributions of this study.
The common noisy texts list is a list of noisy texts that occurred three times or
more in 15,000 online messages created by Malaysians. Nevertheless, noisy
terms that are used by the online users varies based on the environment or
domain of the subject. Therefore, the arti cial abbreviation list complements
the common noisy texts list and produced a better result in the normalization
process. The arti cial abbreviation list is created by projecting noisy terms that
the users may use based on several common patterns of short forms observed
by the researchers.
At the end of the study, the researchers believed that incorporating other
modules could improve the result of noisy text normalization. Among
them are:
1) using English dictionary in addition to the Malay dictionary to identify
OOV words;
2) incorporating a technique to check noisy term as the result of using
suf x and pre x on the Malay words;
http://jict.uum.edu.my/
157
Journal of ICT, 12, 2013, pp: 147159
3) incorporating a technique to solve words that follow the phonetic sound
instead of how it is spelled; and
4) incorporating a list of slang words and words that express expressions
such as arg, oh, zzzz, and hurg.
The impact of using these modules is possible enhancements on the research
in the future. As more and more people use the Internet or other online
applications to communicate with each other, the need to process online text
messages will also increase. The noisy texts that are incorporated in these
messages need to be normalized so that other text processing applications
such as Q & A, customer services, classi cation and information retrieval,
may produce useful and accurate information. The common noisy text list and
the arti cial abbreviation list are two references that may be utilized in noisy
texts normalization process for messages created by Malaysians.
ACKNOWLEDGEMENT
This research is supported by the Fundamental Research Grant Scheme
(FRGS) under the ninth Malaysia Plan (RMK-9), Ministry of Higher
Education (MOHE) Malaysia. The grant number is 600-RMI/ST/FRGS 5/3/
Fst (208/2010).
REFERENCES
Acharyya, S., Negi, S., Subramaniam, L. V., & Roy, S. (2008). Unsupervised
learning of multilingual short message service (SMS) dialect from noisy
examples. Paper presented at the Second Workshop on Analytics for
Noisy Unstructured Text Data, Singapore.
Aw, A., Zhang, M., Xiao, J., & Su, J. (2006). A phrase-based statistical
model for SMS text normalization. Paper at the COLING/ACL on Main
Conference Poster Sessions, Sydney, Australia.
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., & Basu, A.
(2007). Investigation and modeling of the structure of texting language.
International Journal on Document Analysis and Recognition, 10(3),
157-174. doi: 10.1007/s10032-007-0054-0
Clark, A. (2003). Pre-processing very noisy text. Paper presented the Workshop
Shallow Processing at Large, Corpora, Lancester.
http://jict.uum.edu.my/
Journal of ICT, 12, 2013, pp: 147159
158
Cook, P., & Stevenson, S. (2009). An unsupervised model for text message
normalization. Paper presented at the Workshop on Computational
Approaches to Linguistic Creativity, Boulder, Colorado.
Dey, L., & Haque, S. K. M. (2008). Opinion mining from noisy text data. Paper
presented at the Second Workshop on Analytics for Noisy Unstructured
Text Data. Singapore.
Dey, L., & Haque, S. K. M. (2009). Studying the effects of noisy text on
text mining applications. Paper presented at the Third Workshop on
Analytics for Noisy Unstructured Text Data, Barcelona, Spain.
Foster, J., Wagner, J., & Genabith, J. V. (2008). Adapting a WSJ-trained
parser to grammatically noisy text. Paper presented at the 46th Annual
Meeting of the Association for Computational Linguistics on Human
Language Technologies: Short Papers, Columbus, Ohio.
Hussin, S. (2009). Bahasa SMS. Retrieved from http://supyanhussin.
wordpress.com/2009/07/11/bahasa-sms/
Jing, H., Lopresti, D., & Shih, C. (2003). Summarization of noisy documents:
a pilot study. Paper presented at the HLT-NAACL 03 on Text
summarization workshop - Volume 5.
Kobus, C., Yvon, F., & Damnati, G. (2008). Normalizing SMS: Are two
methaphors better than one? Paper presented at the 22nd International
Conference on Computational Linguistics, Manchester.
Kothari, G., Negi, S., Faruquie, T. A., Chakaravarthy, V. T., & Subramaniam,
L. V. (2009). SMS based interface for FAQ retrieval. Paper presented
at the Joint Conference of the 47th Annual Meeting of the ACL and the
4th International Joint Conference on Natural Language Processing of
the AFNLP: Volume 2 - Volume 2, Suntec, Singapore.
Dewan Bahasa dan Pustaka, (2008). Panduan Singkatan Khidmat Pesanan
Ringkas. Retrieved from http://www.dbp.gov.my/khidmatsms.pdf
Samsudin, N., Puteh, M., & Hamdan, A. R. (2011). Bess or xbest: Mining the
Malaysian online reviews. Paper presented at the 3rd Conference on
Data Mining and Optimization (DMO).
http://jict.uum.edu.my/
159
Journal of ICT, 12, 2013, pp: 147159
Tang, J., Li, H., Cao, Y., & Tang, Z. (2005). Email data cleaning.
Paper at the Eleventh ACM SIGKDD International Conference on
Knowledge Discovery in Data Mining, Chicago, Illinois, USA.
Toutanova, K., & Moore, R. C. (2002). Pronounciation modeling for improved
spelling correction. Paper presented at the 40th Annual Meeting on
Association for Computational Linguistics, Philadelphia, Pennsylvania.
Vinciarelli, A. (2004, 23-26 Aug). Noisy text categorization. Paper presented
at the 17th International Conference on Pattern Recognition, ICPR.
Wong, W., Leu, W., & Bennamoun, M. (2006). Integrated scoring for spelling
error correction, abbreviation expansion and case restoration in dirty
text. Paper presented at the Australasian Data Mining Conference,
Sydney.
Wong, W., Liu, W., & Bennamoun, M. (2006). Integrated scoring for spelling
error correction, abbreviation expansion and case restoration in dirty
text. Paper presented at the Fifth Australasian Conference on Data
Mining and Analystics - Volume 61, Sydney, Australia.
http://jict.uum.edu.my/
... The remainder of this paper is organized as follows: Section 2 discusses background studies and related works on normalizing noisy words. Section 3 presents the theoretical framework and methodology to improve on the limitations faced by the previous works [2]. Section 4 presents the implementation of the overall normalisation process using the experimental dataset. ...
... One of the recent works by Samsudin et al. investigated the pattern of abbreviations most used by Malaysians on social media and to create an artificial abbreviation list in order to improve the normalisation process of noisy texts [2]. They have generated a list of abbreviations following a set of rules that encompasses the way Malaysians write noisy terms. ...
... Other than referencing the spelling corrector's algorithm, this project also incorporated the Artificial Abbreviation list rules that was brought forward by Samsudin et al. [2]. However, not all of the rules were applied, seeing as there were rules that manipulated syllables. ...
Article
Full-text available
Users interact using short-formed words and abbreviations and this results in a message full of noisy words that are not recognized by the system's knowledge. The aim of this research is to overcome the limitations that still bar the progression of normalizing Malay noisy words from social media platforms. The testing data gathered is 25,000; 15,000 Tweets from Twitter and 10,000 comments from Facebook respectively. Pre-processing steps were carried out to clean the entire dataset which consists of unique 179,786 words. 36,587 out-of-vocabulary (OOV) Malay terms were then extracted and checked against an in- vocabulary (IV) Malay corpus using the Levenshtein edit distance formula and character manipulation rules. The resultant output is 3,964 unique IV Malay words. Based on the results, the usage of edit distance and rules can be further improved to elevate the normalisation of the ever changing colloquial terms of the Malay language.
... Categorisations appear to be cherry-picked in light of the solution proposed. Misspelled words [13], out-of-vocabulary words [14], ill-formed words and noisy text [14,15,17] were used to describe the unconventional condition of Malay social media text. Basri et al. [13] proposed a framework for an automatic spell-checker and corrector for misspelled words. ...
... Categorisations appear to be cherry-picked in light of the solution proposed. Misspelled words [13], out-of-vocabulary words [14], ill-formed words and noisy text [14,15,17] were used to describe the unconventional condition of Malay social media text. Basri et al. [13] proposed a framework for an automatic spell-checker and corrector for misspelled words. ...
... Basri et al. also handled the universal character "x" which indicates a negation and twicely duplicated words. Samsudin et al. [14] constructed a set of rules capable of automatically-generating artificial noisy text. These rules were based on an earlier work by DBP as an effort to streamline Short Message Service (SMS) texts [16]. ...
Article
Full-text available
span>In this paper, we proposed a preliminary taxonomy of Malay social media text. Performing text analytics on Malay social media text is a challenge. The formal Malay language follows specific spelling and sentence construction rules. However, the Malay language used in social media differs in both aspects. This impedes the accuracy of text analytics. Due to the complexity of Malay social media text, many researches has chosen to focus on classifying the formal Malay language. To the best of our knowledge, we are the first to propose a formal taxonomy for Malay text in social media. Narrow and informal categorisations of Malay social media text can be found amidst efforts to pre-process social media text, yet cherry-picked only some categories to be handled. We have differentiated Malay social media text from the formal Malay language by identifying them as Social Media Malay Language or SMML. They consists of spelling variations , Malay-English mix sentence , Malay-spelling English words , slang-based words, vowel-les words, number suffixes and manner of expression. This taxonomy is expected to serve as a guideline in research and commercial products.</span
... Narrow categorizations were found described within works to normalize the text or efforts to check and correct spelling errors automatically. Malay social media text has been labeled as misspelled words [14], out-of-vocabulary words [14], [15] ill-formed words, and noisy text [16]. A shared assumption of these works is that of the availability of a dictionary of standard Malay spellings to replace these "rogue" text. ...
... Samsudin et al. [15] constructed a set of rules capable of automatically-generating artificial noisy text. These rules were based on an earlier work by DBP [19] also authorized to produce a guideline on how Short Message Service (SMS) text should be used in official correspondences as well as on TV channels in Malaysia. ...
... The data originates from users' activities as they are mostly online to share real-time data which include happening events or trending topics (Matuszka et al., 2013). Due to the increase in the usage of social media, traditional media have been used less in recent times (Himelboim, Mccreery, & Smith, 2013) which has a lesser impact during a disaster compared to the social media (Matar et al., 2016;Tengku et al., 2015) Previous research proved that social media is important for information dissemination among Malaysians (Samsudin, Puteh, Hamdan, & Nazri, 2013) and the general public to give room for better integration with official disaster response (Sutton et al., 2008). Information dissemination on the social media is helpful in various ways. ...
Article
Full-text available
This article is based on a study which examined the information dissemination process on the social media during the Malaysia 2014 floods by employing the Social Network Analysis. Specifically, the study analyzed the type of network structure formed and its density, the influential people involved, and the kind of information shared during the flood. The data was collected from a non-governmental organization fan page (NGOFP) and a significant civilian fan page (ICFP) on Facebook using NodeXL. The two datasets contained 296 posts which generated different network structures based on the state of the flood, information available, and the needs of the information. Through content analysis, five common themes emerged from the information exchanges for both fan pages which helped in providing material and psychological support to the flood victims. However, only 5% of the networks’ population served as information providers, and this prompted the need for more active participation especially from organizations with certified information. Based on the findings presented and elaborated, this article concluded by stating the implications and recommendations of the study conducted.
... Written comments consumed time in interpretation as compared to objects. According to Samsudin, Puteh, Hamdan and Ahmad Nazri (2013), noisy texts is a common phenomenon in online reviews and it affects data mining exercise. Also, comments may be irrelevant or casual (Zhang et al., 2013). ...
Article
Full-text available
Ratings and comments play a dominant role in online reviews. The question, thus, arises as to whether or not there is any consistency in consumer perception of the reviews, and how future choices might be influenced. We analysed 2000 comments of 20 different hotels posted on TripAdvisor to determine if the comments posted by previous guests of a hotel influence the decisions of potential guests. Two hundred human raters were asked to consider 20 reviews and to rate a hotel based on the reviews. The Cohen Kappa coefficient was used to evaluate the degree of agreement on the hotel quality as determined by the human raters and the star rating given by the original reviewer. The results showed a high consistency between the human raters’ evaluation and the reviewers’ star rating. This research reveals the importance of website feedback such as TripAdvisor in influencing consumer choice.
... However, using questionnaires and interviews is not suitable for sentiment analysis nowadays since most people tend to express their opinions, emotions, satisfaction and dissatisfaction via the social media. This makes text and data mining become an important task to analyse amounts of texts and data from the social media server and data warehouse (Chayanukro, Mahmuddin, & Husni, 2014;Samsudin, Puteh, & Hamdan, 2013) One of the essential techniques in this area is the sentiment analysis. This technique is used to analyses people is emotions and sentiment which has spread widely in many countries and languages. ...
Article
In the last decade, the amount of social media usage has rapidly increased exponentially in Thailand. A huge amount of Thai online reviews and comments are available on social network every second. Because of this fact, comment analysis, also called sentiment analysis, has then become an essential task to analyze people’s emotions, opinion, attitudes and sentiments from the amount of these online posts. This paper proposed the technique for analyzing Thai customers’ comments or opinions about the products and services by counting the polarity words of the product and service domains. To demonstrate the proposed technique, experimental studies on analyzing Thai customers’ comments in the social media are presented in this paper. The comments are classified into neutral, positive or negative. The proposed technique benefits the business domain in guiding product improvement and quality of service. Hence, this paper also benefits the end-users in making a smart decision.
... Recent years have witnessed an increase in social tension contagious events on social media [1], [2]. Several number of sentiment analysis studies on social tension detection have been conducted [26], [27], [28], [1], [2]. These studies have proven that public emotion can be analyzed from social media. ...
... The data originates from users' activities as they are mostly online to share real-time data which include happening events or trending topics (Matuszka et al., 2013). Due to the increase in the usage of social media, traditional media have been used less in recent times (Himelboim, Mccreery, & Smith, 2013) which has a lesser impact during a disaster compared to the social media (Matar et al., 2016;Tengku et al., 2015) Previous research proved that social media is important for information dissemination among Malaysians (Samsudin, Puteh, Hamdan, & Nazri, 2013) and the general public to give room for better integration with official disaster response (Sutton et al., 2008). Information dissemination on the social media is helpful in various ways. ...
Article
Full-text available
This article is based on a study which examined the information dissemination process on the social media during the Malaysia 2014 floods by employing the Social Network Analysis. Specifically, the study analyzed the type of network structure formed and its density, the influential people involved, and the kind of information shared during the flood. The data was collected from a non-governmental organization fan page (NGOFP) and a significant civilian fan page (ICFP) on Facebook using NodeXL. The two datasets contained 296 posts which generated different network structures based on the state of the flood, information available, and the needs of the information. Through content analysis, five common themes emerged from the information exchanges for both fan pages which helped in providing material and psychological support to the flood victims. However, only 5% of the networks' population served as information providers, and this prompted the need for more active participation especially from organizations with certified information. Based on the findings presented and elaborated, this article concluded by stating the implications and recommendations of the study conducted
... In [14], Malay newspaper sentences are classified based on the artificial immune concept called the negative selection algorithm, while [15] utilize a series of Malay stemming algorithms, namely, reverse porter algorithm, backward forward algorithm and the immune network algorithm, for sentiment classification of Malay newspaper articles. In [16], noise and its impact on the sentiment classification of Malay movie reviews are investigated. A more recent work by [17] investigate the effects of noise removal and stemming using a series of supervised classifiers. ...
Conference Paper
Full-text available
Vital to the task of mining sentiment from text is a sentiment lexicon, or a dictionary of terms annotated for their a priori information across the semantic dimension of sentiment. Each term has assigned a general, out-of-context sentiment polarity. Unfortunately, online dictionaries and similar lexical resources do not readily include information on the sentiment properties of their entries. Moreover, manually compiling sentiment lexicons is tedious in terms of annotator time and effort. This has resulted in the emergence of a large volume of research concentrated on automated sentiment lexicon generation algorithms. Most of these algorithms were designed for English, attributable to the abundance of readily available lexical resources in this language. This is not the case for low-resource languages such as the Malay language. Although there has been an exponential increase in research on Malay sentiment analysis over the past few years, the subtask of sentiment lexicon induction for this particular language remains under-investigated. We present a minimally-supervised sentiment lexicon induction model specifically designed for the Malay language. It takes as input only two initial paradigm positive and negative terms, and mines WordNet Bahasa’s synonym chains and Kamus Dewan’s gloss information to extract subjective, sentiment-laden terms. The model automatically bootstraps a reliable, high coverage sentiment lexicon that can be employed in Malay sentiment analysis on full-text. Intrinsic evaluation of the model against a manually annotated test set demonstrates that its ability to assign sentiment properties to terms is on par with human judgement.
... Written comments consumed time in interpretation as compared to objects. According to Samsudin, Puteh, Hamdan and Ahmad Nazri (2013), noisy texts is a common phenomenon in online reviews and it affects data mining exercise. Also, comments may be irrelevant or casual (Zhang et al., 2013). ...
Article
Full-text available
Ratings and comments play a dominant role in online reviews. The question, thus, arises as to whether or not there is any consistency in consumer perception of the reviews, and how future choices might be influenced. We analysed 2000 comments of 20 different hotels posted on TripAdvisor to determine if the comments posted by previous guests of a hotel influence the decisions of potential guests. Two hundred human raters were asked to consider 20 reviews and to rate a hotel based on the reviews. The Cohen Kappa coefficient was used to evaluate the degree of agreement on the hotel quality as determined by the human raters and the star rating given by the original reviewer. The results showed a high consistency between the human raters’ evaluation and the reviewers’ star rating. This research reveals the importance of website feedback such as TripAdvisor in influencing consumer choice.
Article
Full-text available
Advancement in information and technology facilities especially the Internet has changed the way we communicate and express opinions or sentiments on services or products that we consume. Opinion mining aims to automate the process of mining opinions into the positive or the negative views. It will benefit both the customers and the sellers in identifying the best product or service. Although there are researchers that explore new techniques of identifying the sentiment polarization, few works have been done on opinion mining created by the Malaysian reviewers. The same scenario happens to micro-text. Therefore in this study, we conduct an exploratory research on opinion mining of online movie reviews collected from several forums and blogs written by the Malaysian. The experiment data are tested using machine learning classifiers i.e. Support VectorMachine, Naïve Baiyes and k-Nearest Neighbor. The result illustrates that the performance of these machine learning techniques without any preprocessing of the micro-texts or feature selection is quite low. Therefore additional steps are required in order to mine the opinions from these data.
Conference Paper
Full-text available
Noise in textual data such as those introduced by multi-linguality, misspellings, abbreviations, deletions, phonetic spellings, non standard transliteration, etc pose considerable problems for text-mining. Such corruptions are very common in instant messenger (IM) and short message service (SMS) data and adversely affect off the shelf text mining methods. Most techniques address this problem by supervised methods. But they require labels that are very expensive and time consuming to obtain. While we do not champion unsupervised methods over supervised when quality of results is the supreme and singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention to generate parallely labelled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A Hidden Markov Model (HMM) over subsequencized representation of words is used subject to a parameterization such that the training phase involves clustering over vectors and not the customary dynamic programming over sequences. A principled transformation of maximum likelihood based "central clustering" cost function into a "pairwise similarity" based clustering is proposed. This transformation makes it possible to apply "subsequence kernel" based methods that model delete and insert edit operations well. The novelty of this approach lies in that the expensive (Baum-Welch) iterations required for HMM, can be avoided through a careful factorization of the HMM Loglikelihood and in establishing the connection between information theoretic cost function and the kernel approach of machine learning. Anecdotal evidence of efficacy is provided on public and proprietary data.
Conference Paper
Full-text available
Electronic written texts used in computer-mediated interactions (e-mails, blogs, chats, etc) present major deviations from the norm of the language. This paper presents an comparative study of systems aiming at normalizing the orthography of French SMS messages: after discussing the linguistic peculiarities of these messages, and possible approaches to their automatic normalization, we present, evaluate and contrast two systems, one drawing inspiration from the Machine Translation task; the other using techniques that are commonly used in automatic speech recognition devices. Combining both approaches, our best normalization system achieves about 11% Word Error Rate on a test set of about 3000 unseen messages.
Conference Paper
Full-text available
Short Messaging Service (SMS) is popu- larly used to provide information access to people on the move. This has resulted in the growth of SMS based Question An- swering (QA) services. However auto- matically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In this work we present an automatic FAQ-based question answering system for SMS users. We han- dle the noise in a SMS query by formu- lating the query similarity over FAQ ques- tions as a combinatorial search problem. The search space consists of combinations of all possible dictionary variations of to- kens in the noisy query. We present an ef- ficient search algorithm that does not re- quire any training data or SMS normaliza- tion and can handle semantic variations in question formulation. We demonstrate the effectiveness of our approach on two real- life datasets.
Article
Full-text available
Language usage over computer mediated dis- courses, like chats, emails and SMS texts, signif- icantly differs from the standard form of the lan- guage. An urge towards shorter message length facilitating faster typing and the need for seman- tic clarity, shape the structure of this non-standard form known as the texting language. In this work we formally investigate the nature and type of com- pressions used in SMS texts, and based on the find- ings develop a word level model for the texting lan- guage. For every word in the standard language, we construct a Hidden Markov Model that succinctly represent all possible variations of that word in the texting language along with their associated obser- vation probabilities. The structure of the HMM is novel and arrived at through linguistic analysis of the SMS data. The model parameters have been estimated from a word-aligned SMS and standard English parallel corpus, through machine learning techniques. Preliminary evaluation shows that the word-model can be used for decoding texting lan- guage words to their standard counterparts with more than 80% accuracy.
Article
Full-text available
We present a robust parser which is trained on a treebank of ungrammatical sentences. The treebank is created automatically by modifying Penn treebank sentences so that they contain one or more syntactic errors. We evaluate an existing Penn-treebank-trained parser on the ungrammatical treebank to see how it reacts to noise in the form of grammatical errors. We re-train this parser on the training section of the ungrammatical treebank, leading to an significantly improved performance on the ungrammatical test sets. We show how a classifier can be used to prevent performance degradation on the original grammatical data.
Article
Cell phone text messaging users express themselves briefly and colloquially using a variety of creative forms. We analyze a sample of creative, non-standard text message word forms to determine frequent word formation processes in texting language. Drawing on these observations, we construct an unsupervised noisy-channel model for text message normalization. On a test set of 303 text message forms that differ from their standard form, our model achieves 59% accuracy, which is on par with the best supervised results reported on this dataset.
Conference Paper
Addressed in this paper is the issue of 'email data cleaning' for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean up email data before conducting mining. Although several products offer email cleaning features, the types of noises that can be processed are limited. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. Thus, it is made independent of any specific mining process. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. To the best of our knowledge, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVMs) have been proposed in this paper. Features used in the models have also been defined. Experimental results indicate that the proposed SVM based methods for email cleaning can significantly outperform the baseline methods. The proposed method has also been applied to term extraction, a typical text mining task. Experimental results show that the accuracy of term extraction can be significantly improved after applying the email data cleaning method proposed in this paper.
Conference Paper
Short Messaging Service (SMS) texts be- have quite differently from normal written texts and have some very special phenom- ena. To translate SMS texts, traditional approaches model such irregularities di- rectly in Machine Translation (MT). How- ever, such approaches suffer from customization problem as tremendous ef- fort is required to adapt the language model of the existing translation system to handle SMS text style. We offer an alter- native approach to resolve such irregulari- ties by normalizing SMS texts before MT. In this paper, we view the task of SMS normalization as a translation problem from the SMS language to the English language 1 and we propose to adapt a phrase-based statistical MT model for the task. Evaluation by 5-fold cross validation on a parallel SMS normalized corpus of 5000 sentences shows that our method can achieve 0.80702 in BLEU score against the baseline BLEU score 0.6958. Another experiment of translating SMS texts from English to Chinese on a separate SMS text corpus shows that, using SMS normaliza- tion as MT preprocessing can largely boost SMS translation performance from 0.1926 to 0.3770 in BLEU score.
Conference Paper
Text mining aims at deriving high quality information from text in an automated way. Text mining applications rely on Natural Language Processing (NLP) tools like tagger, parser etc. to locate and retrieve relevant information in an application specific manner. Most of these NLP tools however have been designed to work on clean and grammatically correct text. Presently, many organizations are interested to derive information from informally written text that is generated as a result of human communication through emails, or blog posts, web-based reviews etc. These texts are highly noisy and often found to contain mixture of languages. In this study we present some analysis on how noise introduced due to incorrect English affects the performance of some of the NLP tools and thereafter the text mining applications. The text mining application that we focus on is opinion mining. Opinion mining is the most significant text mining application that has to deal with noisy text generated in an unregulated fashion by users.