Conference PaperPDF Available

Abstract

Preprocessing is an important task and critical step in Text mining, Natural Language Processing (NLP) and information retrieval (IR). In the area of Text Mining, data preprocessing used for extracting interesting and non-trivial and knowledge from unstructured text data. Information Retrieval (IR) is essentially a matter of deciding which documents in a collection should be retrieved to satisfy a user's need for information. The user's need for information is represented by a query or profile, and contains one or more search terms, plus some additional information such as weight of the words. Hence, the retrieval decision is made by comparing the terms of the query with the index terms (important words or phrases) appearing in the document itself. The decision may be binary (retrieve/reject), or it may involve estimating the degree of relevance that the document has to query. Unfortunately, the words that appear in documents and in queries often have many structural variants. So before the information retrieval from the documents, the data preprocessing techniques are applied on the target data set to reduce the size of the data set which will increase the effectiveness of IR System The objective of this study is to analyze the issues of preprocessing methods such as Tokenization, Stop word removal and Stemming for the text documents Keywords: Text Mining, NLP, IR, Stemming
Preprocessing Techniques for Text Mining
Dr.S.Kannan, Vairaprakash Gurusamy,
Associate Professor, Research Scholar,
Department of Computer Applications, Department of Computer Applications,
Madurai Kamaraj University. Madurai Kamaraj University.
skannanmku@gmail.com vairaprakashmca@gmail.com
Abstract
Preprocessing is an important task and
critical step in Text mining, Natural
Language Processing (NLP) and information
retrieval (IR). In the area of Text Mining, data
preprocessing used for extracting interesting
and non-trivial and knowledge from
unstructured text data. Information Retrieval
(IR) is essentially a matter of deciding which
documents in a collection should be retrieved
to satisfy a user's need for information. The
user's need for information is represented by
a query or profile, and contains one or more
search terms, plus some additional
information such as weight of the words.
Hence, the retrieval decision is made by
comparing the terms of the query with the
index terms (important words or phrases)
appearing in the document itself. The
decision may be binary (retrieve/reject), or it
may involve estimating the degree of
relevance that the document has to query.
Unfortunately, the words that appear in
documents and in queries often have many
structural variants. So before the information
retrieval from the documents, the data
preprocessing techniques are applied on the
target data set to reduce the size of the data
set which will increase the effectiveness of IR
System The objective of this study is to
analyze the issues of preprocessing methods
such as Tokenization, Stop word removal and
Stemming for the text documents
Keywords: Text Mining, NLP, IR, Stemming
I. Introduction
Text pre-processing
is an essential part of any NLP system, since
the characters, words, and sentences
identified at this stage are the fundamental
units passed to all further processing stages,
from analysis and tagging components, such
as morphological analyzers and part-of-
speech taggers, through applications, such as
information retrieval and machine translation
systems. It is a Collection of activities in
which Text Documents are pre-processed.
Because the text data often contains some
special formats like number formats, date
formats and the most common words that
unlikely to help Text mining such as
prepositions, articles, and pro-nouns can be
eliminated
Need of Text Preprocessing in NLP System
1. To reduce indexing(or data) file size
of the Text documents
i) Stop words accounts 20-30%
of total word counts in a
particular text documents
ii) Stemming may reduce
indexing size as much as 40-
50%
2. To improve the efficiency and
effectiveness of the IR system
i) Stop words are not useful for
searching or Text mining and
they may confuse the retrieval
system
ii) Stemming used for matching
the similar words in a text
document
II. Tokenization
Tokenization is the process of breaking a
stream of text into words, phrases, symbols,
or other meaningful elements called tokens
.The aim of the tokenization is the
exploration of the words in a sentence. The
list of tokens becomes input for further
processing such as parsing or text mining.
Tokenization is useful both in linguistics
(where it is a form of text segmentation), and
in computer science, where it forms part of
lexical analysis. Textual data is only a block
of characters at the beginning. All processes
in information retrieval require the words of
the data set. Hence, the requirement for a
parser is a tokenization of documents. This
may sound trivial as the text is already stored
in machine-readable formats. Nevertheless,
some problems are still left, like the removal
of punctuation marks. Other characters like
brackets, hyphens, etc require processing as
well. Furthermore, tokenizer can cater for
consistency in the documents. The main use
of tokenization is identifying the meaningful
keywords. The inconsistency can be different
number and time formats. Another problem
are abbreviations and acronyms which have
to be transformed into a standard form.
Challenges in Tokenization
Challenges in tokenization depend on
the type of language. Languages such as
English and French are referred to as space-
delimited as most of the words are separated
from each other by white spaces. Languages
such as Chinese and Thai are referred to as
unsegmented as words do not have clear
boundaries. Tokenizing unsegmented
language sentences requires additional
lexical and morphological information.
Tokenization is also affected by writing
system and the typographical structure of the
words. Structure of languages can be grouped
into three categories:
Isolating: Words do not divide into smaller
units. Example: Mandarin Chinese
Agglutinative: Words divide into smaller
units. Example: Japanese, Tamil
Inflectional: Boundaries between
morphemes are not clear and ambiguous in
terms of grammatical meaning. Example:
Latin.
III. Stop Word Removal
Many words in documents recur very
frequently but are essentially meaningless as
they are used to join words together in a
sentence. It is commonly understood that stop
words do not contribute to the context or
content of textual documents. Due to their
high frequency of occurrence, their presence
in text mining presents an obstacle in
understanding the content of the documents.
Stop words are very frequently used
common words like ‘and’, ‘are’, ‘this’ etc.
They are not useful in classification of
documents. So they must be removed.
However, the development of such stop
words list is difficult and inconsistent
between textual sources. This process also
reduces the text data and improves the system
performance. Every text document deals with
these words which are not necessary for text
mining applications.
IV. Stemming
Stemming is the process of conflating
the variant forms of a word into a common
representation, the stem. For example, the
words: “presentation”, “presented”,
“presenting” could all be reduced to a
common representation “present”. This is a
widely used procedure in text processing for
information retrieval (IR) based on the
assumption that posing a query with the term
presenting implies an interest in documents
containing the words presentation and
presented.
Errors in Stemming
There are mainly two errors in
stemming.
1. over stemming
2. under stemming
Over-stemming is when two words with
different stems are stemmed to the same root.
This is also known as a false positive.
Under-stemming is when two words that
should be stemmed to the same root are not.
This is also known as a false negative.
TYPES OF STEMMING
ALGORITHMS
i) Table Look Up Approach
One method to do stemming is to store
a table of all index terms and their stems.
Terms from the queries and indexes could
then be stemmed via lookup table, using b-
trees or hash tables. Such lookups are very
fast, but there are problems with this
approach. First there is no such data for
English, even if there were they may not be
represented because they are domain specific
and require some other stemming methods.
Second issue is storage overhead.
ii) Successor Variety
Successor variety stemmers are based
on the structural linguistics which determines
the word and morpheme boundaries based on
distribution of phonemes. Successor variety
of a string is the number of characters that
follow it in words in some body of text. For
example consider a body of text consisting of
following words.
Able, ape, beatable, finable, read, readable,
reading, reads, red, rope, ripe.
Let’s determine the successor variety
for the word read. First letter in read is R. R
is followed in the text body by 3 characters E,
I, O thus the successor variety of R is 3. The
next successor variety for read is 2 since A,
D follows RE in the text body and so on.
Following table shows the complete
successor variety for the word read.
Prefix
Successor Variety
Letters
R
3
E,I,O
RE
2
A,D
REA
1
D
READ
3
A,I,S
Table 1.1 Successor variety for word read
Once the successor variety for a
given word is determined then this
information is used to segment the word.
Hafer and Weiss discussed the ways of doing
this.
1. Cut Off Method: Some cutoff value is
selected and a boundary is identified
whenever the cut off value is reached.
2. Peak and Plateau method: In this method a
segment break is made after a character
whose successor variety exceeds that of the
characters immediately preceding and
following it.
3. Complete word method: Break is made
after a segment if a segment is a complete
word in the corpus.
iii) N-Gram stemmers
This method has been designed by
Adamson and Boreham. It is called as shared
digram method. Digram is a pair of
consecutive letters. This method is called n-
gram method since trigram or n-grams could
be used. In this method association measures
are calculated between the pairs of terms
based on shared unique digram.
For example: consider two words
Stemming and Stemmer
Stemming st te em mm mi in ng
Stemmer st te em mm me er
In this example the word
stemming has 7 unique digrams, stemmer has
6 unique digrams, these two words share 5
unique digrams st, te, em, mm ,me. Once the
number of unique digrams is found then a
similarity measure based on the unique
digrams is calculated using dice coefficient.
Dice coefficient is defined as
S=2C/(A+B)
Where C is the common unique digrams, A is
the number of unique digrams in first word;
B is the number of unique digrams in second
word. Similarity measures are determined for
all pairs of terms in the database, forming a
similarity matrix. Once such a similarity
matrix is available, the terms are clustered
using a single link clustering method.
iv) Affix Removal Stemmers
Affix removal stemmers removes
the suffixes or prefixes from the terms
leaving the stem. One of the example of the
affix removal stemmer is one which removes
the plurals form of the terms. Some set of
rules for such a stemmer are as follows
(Harman)
a) If a word ends in “ies” but not “eies” or
“aies ”
Then “ies” -> “y”
b) If a word ends in “es” but not “aes”, or
“ees” or “oes”
Then “es” -> “e”
c) If a word ends in “s” but not “us” or “ss ”
Then “s” -> “NULL”
V. Conclusion
In this work we have presented
efficient preprocessing techniques. These
pre-processing techniques eliminates noisy
from text data, later identifies the root word
for actual words and reduces the size of the
text data. This improves performance of the
IR system.
References
1.Vishal Gupta , Gurpreet S. Lehal “A Survey of Text
Mining Techniques and Applications” Journal of
Emerging technologies in web intelligence, vol,1 no1
August 2009.
2. Durmaz,O.Bilge, H.S “Effect of dimensionality
reduction and feature selection in text classification
in IEEE conference ,2011, Page 21-24 ,2011.
3. G.Salton. The SMART Retrieval System:
Experiments in Automatic Document Processing.
Prentice-Hall, Inc.
4. Paice Chris D. “An evaluation method for stemming
algorithms”. Proceedings of the 17th annual
international ACM SIGIR conference on Research and
development in information retrieval. 1994, 42- 50.
5. J. Cowie and Y. Wilks, Information extraction, New
York, 2000.
6. Ms. Anjali Ganesh Jivani “A Comparative Study of
Stemming Algorithms” Int. J. Comp.
Tech. Appl., Vol 2 (6), 1930-1938
... To get clean data to be processed using the natural language processing techniques (Kannan Gurusamy 2014), it was mandatory to drop the null fields. Besides, all text was converted to capital letters, to eliminate the case sensitive On the other hand, text answers with non-clarified ideas were deleted by the experts. ...
Article
Full-text available
COVID-19 is a disease that affects the quality of life in all aspects. However, the government policy applied in 2020 impacted the lifestyle of the whole world. In this sense, the study of sentiments of people in different countries is a very important task to face future challenges related to lockdown caused by a virus. To contribute to this objective, we have proposed a natural language processing model with the aim to detect positive and negative feelings in open-text answers obtained from a survey in pandemic times. We have proposed a distilBERT transformer model to carry out this task. We have used three approaches to perform a comparison, obtaining for our best model the following average metrics: Accuracy: 0.823, Precision: 0.826, Recall: 0.793 and F 1 Score: 0.803.
... Tokenization: "Is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens .The aim of the tokenization is the exploration of the words in a sentence. The list of tokens becomes input for further processing such as parsing or text mining".[24] b) Normalization :"Convert words into a normalized formsdown-case, e.g, ...
Article
Full-text available
The rapid advancement of mobile phone systems and programs that support free Instant messaging (IM), short messaging services (SMS), and the convenience of sending millions of messages with practically no delay and almost zero cost, through WIFI or 3G has led to the increasing popularity of short messaging services . The requirement is an automatic classification system for quick classification for the received messages. . in order to detect the suspicious message .In this paper we use detection model in which social media messages are classified as a predefined classes named suspicious and not suspicious .The proposed system try achieved this problem through simple method known as level based feature content . In this method the content feature is divided into four levels to detection suspicious.This system works offline , collecting the message online , save it and then input to the proposed system.The experimental result show that the level three detection rate is higher than the rest levels with accuracy 0.952381 when the threshold 0.06 and to improvement the overall levels results of the system we use the majority test, where the accuracy reached to 100% Keywords: classification, instant message,suspicious,non suspicious, viber, level of features
Article
The massive spread of social networks provided a plethora of new possibilities to communicate and interact worldwide. On the other hand, they introduced some negative phenomena related to social media addictions, as well as additional tools for cyberbullying and cyberterrorism activities. Therefore, monitoring operations on the posted contents and on the users behavior has become essential to guarantee a safe and correct use of the network. This task is even more challenging in presence of borderline users, namely users who appear risky according to their posts, but not according to other perspectives. In this context, this paper contributes towards an automated identification of risky users in social networks. Specifically, we propose a novel system, called SAIRUS, that solves node classification tasks in social networks by exploiting and combining the information conveyed by three different perspectives: the semantics of the textual content generated by users, the network of user relationships, and the users spatial closeness, derived from the geo-tagging data associated with the posted contents. Contrary to existing approaches that typically inject features built from one perspective into the other, we learn three separate models that exploit the peculiarity of each kind of data, and then learn a model to fuse their contribution using a stacked generalization approach. Our extensive experimental evaluation, performed on two variants of a real-world Twitter dataset, revealed the superiority of the proposed method, in comparison with 15 competitors based on one of the considered perspectives alone, or on a combination thereof. Such a superiority is also clear when specifically focusing on borderline users, confirming the applicability of SAIRUS in real-world social networks, which are potentially affected by noisy data.
Chapter
The maintenance of civility during political campaigns has taken a huge blow during recent times with ever increasing toxicity in speeches targeting polarization of people for electoral gains. This is however creating a big divide in the society by the proliferation of cyber-hate speech, which is threatening the integrity and harmony of societies. The term “hate speech” refers to the use of words or phrases that are threatening, derogatory, or insulting to a specific individual or group. Users of social media in India are increasing rapidly, and this is coupled with an increase in the frequency with which cyber-hate speech targets specific segments of society or individuals based on their caste, color, or creed. An online environment free of hostility and bigotry has remained a key focus area of academics’ attention. In the present study, some contemporary political data from the Twitter platform is being used to identify hate speech with the help of machine training and learning methods including the text-based natural language processing (NLP). To achieve the identified goal, several political tweets across different ideologies have been taken into consideration. There are many different ways to collect and organize emotions and personality traits. Analysis of the processed dataset has been done using an ensemble of popular machine learning algorithms, and the results indicate the comparative performance of the methods.KeywordsText toxicityHate speechMachine learningNLP
Article
Preprocessing is an important part of any opinion mining method since it prepares the text reviews for classification. In preprocessing, string matching is crucial to remove matched unnecessary text from the input data. The majority of contemporary string matching algorithms use the character comparison method, which analyzes each character independently and takes more time. Furthermore, establishing the degree of similarity between the sub-string and pattern text is challenging in approximate matching algorithms that expect a full match. We propose an algorithm, namely ‘review preprocessing using Coiflet wavelet back-propagation neural network (RPP-COIF-BPN),’ for effective review preprocessing in order to address these challenges and limitations. This RPP-COIF-BPN algorithm uses a combination of a neural network and an exact string matching technique to filter irrelevant information from the input reviews. The proposed method is driven by the Coiflet wavelet; specifically, the 2D Coiflet process is performed in both directions to provide more energetic stop-word features, which increases string matching accuracy. The exact match comparison is performed only for the words matched in the BPN network, as opposed to traditional exact pattern matching, resulting in a considerable reduction in pattern matching time. The proposed method achieves 97.53% accuracy, and it consumes significantly less time of 2.08 s when compared with other string matching algorithms. The results indicated that the proposed RPP-COIF-BPN-based string matching performance is effective for preprocessing e-commerce reviews for opinion mining in a short amount of time with high testing accuracies.
Article
With the advent of social media, human dynamics studied in purely physical space have been extended to that of a cyber and relational context. However, connections and interactions between these hybrid spaces have not been sufficiently investigated. The “space-place (Splatial)” framework proposed in recent years allows capturing human activities in the hybrid of spaces. This study applies the Splatial framework to examine the information propagation between cyber, relational, and physical spaces through a case study of Covid-19 vaccine debates in New York State (NYS). Whereby the physical space represents the regional boundaries and locations of social media (i.e., Twitter) users in NYS, the relational space indicates the social networks of these NYS users, and the cyber space captures the larger conversational context of the vaccination debate. Our results suggest that the Covid-19 vaccine debate is not polarized across all three spaces as compared to that of other vaccines. However, the rate of users with a pro-vaccine stance decreases from physical to relational and cyber spaces. We also found that while users from different spaces interact with each other, they also engage in local communications with users from the same region or same space, and distance-based and boundary-confined clusters exist in cyber and relational space communities. These results based on the Splatial framework not only shed light on the vaccination debates but also help to define and elucidate the relationships between the three spaces. The intense interactions between spaces suggest incorporating people’s relational network and cyber presence in physical place-making.
Article
Full-text available
Stemming is a pre-processing step in Text Mining applications as well as a very common requirement of Natural Language processing functions. In fact it is very important in most of the Information Retrieval systems. The main purpose of stemming is to reduce different grammatical forms / word forms of a word like its noun, adjective, verb, adverb etc. to its root form. We can say that the goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. In this paper we have discussed different methods of stemming and their comparisons in terms of usage, advantages as well as limitations. The basic difference between stemming and lemmatization is also discussed
Article
Full-text available
Text Mining has become an important research area. Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. In this paper, a Survey of Text Mining techniques and applications have been s presented.
Conference Paper
The goal of classifying text or generally data is to decrease the time of access to the information. Continuously increasing number of documents makes the classification process impossible to do manually. In this case, the automatic text classification systems are activated. In these systems, large data space is an important problem. By using dimensionality reduction techniques and feature selection in text classification systems, it is possible to do right classification with reduced size of data. In this study, Discrete Cosine Transform (DCT) method and the feature selection with Proportion of Variance method are proposed to get more effective results for classification results and short classification time is aimed. In experimental studies WebKB and R8 datasets in Reuters-21578 are used. By using DCT method classification success is highly preserved and with Proportion of Variance method classification success increase.
Conference Paper
The effectiveness of stemming algorithms has usually been measured in terms of their effect on retrieval performance with test collections. This however does not provide any insights which might help in stemmer optimisation. This paper describes a method in which stemming performance is assessed against predefine concept groups in samples of words. This enables various indices of stemming performance and weight to be computed. Results are reported for three stemming algorithms. The validity and usefulness of the approach, and the problems of conceptual grouping, are discussed, and directions for further research are identified.