Conference PaperPDF Available

Abstract

Preprocessing is an important task and critical step in Text mining, Natural Language Processing (NLP) and information retrieval (IR). In the area of Text Mining, data preprocessing used for extracting interesting and non-trivial and knowledge from unstructured text data. Information Retrieval (IR) is essentially a matter of deciding which documents in a collection should be retrieved to satisfy a user's need for information. The user's need for information is represented by a query or profile, and contains one or more search terms, plus some additional information such as weight of the words. Hence, the retrieval decision is made by comparing the terms of the query with the index terms (important words or phrases) appearing in the document itself. The decision may be binary (retrieve/reject), or it may involve estimating the degree of relevance that the document has to query. Unfortunately, the words that appear in documents and in queries often have many structural variants. So before the information retrieval from the documents, the data preprocessing techniques are applied on the target data set to reduce the size of the data set which will increase the effectiveness of IR System The objective of this study is to analyze the issues of preprocessing methods such as Tokenization, Stop word removal and Stemming for the text documents Keywords: Text Mining, NLP, IR, Stemming
Preprocessing Techniques for Text Mining
Dr.S.Kannan, Vairaprakash Gurusamy,
Associate Professor, Research Scholar,
Department of Computer Applications, Department of Computer Applications,
Madurai Kamaraj University. Madurai Kamaraj University.
skannanmku@gmail.com vairaprakashmca@gmail.com
Abstract
Preprocessing is an important task and
critical step in Text mining, Natural
Language Processing (NLP) and information
retrieval (IR). In the area of Text Mining, data
preprocessing used for extracting interesting
and non-trivial and knowledge from
unstructured text data. Information Retrieval
(IR) is essentially a matter of deciding which
documents in a collection should be retrieved
to satisfy a user's need for information. The
user's need for information is represented by
a query or profile, and contains one or more
search terms, plus some additional
information such as weight of the words.
Hence, the retrieval decision is made by
comparing the terms of the query with the
index terms (important words or phrases)
appearing in the document itself. The
decision may be binary (retrieve/reject), or it
may involve estimating the degree of
relevance that the document has to query.
Unfortunately, the words that appear in
documents and in queries often have many
structural variants. So before the information
retrieval from the documents, the data
preprocessing techniques are applied on the
target data set to reduce the size of the data
set which will increase the effectiveness of IR
System The objective of this study is to
analyze the issues of preprocessing methods
such as Tokenization, Stop word removal and
Stemming for the text documents
Keywords: Text Mining, NLP, IR, Stemming
I. Introduction
Text pre-processing
is an essential part of any NLP system, since
the characters, words, and sentences
identified at this stage are the fundamental
units passed to all further processing stages,
from analysis and tagging components, such
as morphological analyzers and part-of-
speech taggers, through applications, such as
information retrieval and machine translation
systems. It is a Collection of activities in
which Text Documents are pre-processed.
Because the text data often contains some
special formats like number formats, date
formats and the most common words that
unlikely to help Text mining such as
prepositions, articles, and pro-nouns can be
eliminated
Need of Text Preprocessing in NLP System
1. To reduce indexing(or data) file size
of the Text documents
i) Stop words accounts 20-30%
of total word counts in a
particular text documents
ii) Stemming may reduce
indexing size as much as 40-
50%
2. To improve the efficiency and
effectiveness of the IR system
i) Stop words are not useful for
searching or Text mining and
they may confuse the retrieval
system
ii) Stemming used for matching
the similar words in a text
document
II. Tokenization
Tokenization is the process of breaking a
stream of text into words, phrases, symbols,
or other meaningful elements called tokens
.The aim of the tokenization is the
exploration of the words in a sentence. The
list of tokens becomes input for further
processing such as parsing or text mining.
Tokenization is useful both in linguistics
(where it is a form of text segmentation), and
in computer science, where it forms part of
lexical analysis. Textual data is only a block
of characters at the beginning. All processes
in information retrieval require the words of
the data set. Hence, the requirement for a
parser is a tokenization of documents. This
may sound trivial as the text is already stored
in machine-readable formats. Nevertheless,
some problems are still left, like the removal
of punctuation marks. Other characters like
brackets, hyphens, etc require processing as
well. Furthermore, tokenizer can cater for
consistency in the documents. The main use
of tokenization is identifying the meaningful
keywords. The inconsistency can be different
number and time formats. Another problem
are abbreviations and acronyms which have
to be transformed into a standard form.
Challenges in Tokenization
Challenges in tokenization depend on
the type of language. Languages such as
English and French are referred to as space-
delimited as most of the words are separated
from each other by white spaces. Languages
such as Chinese and Thai are referred to as
unsegmented as words do not have clear
boundaries. Tokenizing unsegmented
language sentences requires additional
lexical and morphological information.
Tokenization is also affected by writing
system and the typographical structure of the
words. Structure of languages can be grouped
into three categories:
Isolating: Words do not divide into smaller
units. Example: Mandarin Chinese
Agglutinative: Words divide into smaller
units. Example: Japanese, Tamil
Inflectional: Boundaries between
morphemes are not clear and ambiguous in
terms of grammatical meaning. Example:
Latin.
III. Stop Word Removal
Many words in documents recur very
frequently but are essentially meaningless as
they are used to join words together in a
sentence. It is commonly understood that stop
words do not contribute to the context or
content of textual documents. Due to their
high frequency of occurrence, their presence
in text mining presents an obstacle in
understanding the content of the documents.
Stop words are very frequently used
common words like ‘and’, ‘are’, ‘this’ etc.
They are not useful in classification of
documents. So they must be removed.
However, the development of such stop
words list is difficult and inconsistent
between textual sources. This process also
reduces the text data and improves the system
performance. Every text document deals with
these words which are not necessary for text
mining applications.
IV. Stemming
Stemming is the process of conflating
the variant forms of a word into a common
representation, the stem. For example, the
words: “presentation”, “presented”,
“presenting” could all be reduced to a
common representation “present”. This is a
widely used procedure in text processing for
information retrieval (IR) based on the
assumption that posing a query with the term
presenting implies an interest in documents
containing the words presentation and
presented.
Errors in Stemming
There are mainly two errors in
stemming.
1. over stemming
2. under stemming
Over-stemming is when two words with
different stems are stemmed to the same root.
This is also known as a false positive.
Under-stemming is when two words that
should be stemmed to the same root are not.
This is also known as a false negative.
TYPES OF STEMMING
ALGORITHMS
i) Table Look Up Approach
One method to do stemming is to store
a table of all index terms and their stems.
Terms from the queries and indexes could
then be stemmed via lookup table, using b-
trees or hash tables. Such lookups are very
fast, but there are problems with this
approach. First there is no such data for
English, even if there were they may not be
represented because they are domain specific
and require some other stemming methods.
Second issue is storage overhead.
ii) Successor Variety
Successor variety stemmers are based
on the structural linguistics which determines
the word and morpheme boundaries based on
distribution of phonemes. Successor variety
of a string is the number of characters that
follow it in words in some body of text. For
example consider a body of text consisting of
following words.
Able, ape, beatable, finable, read, readable,
reading, reads, red, rope, ripe.
Let’s determine the successor variety
for the word read. First letter in read is R. R
is followed in the text body by 3 characters E,
I, O thus the successor variety of R is 3. The
next successor variety for read is 2 since A,
D follows RE in the text body and so on.
Following table shows the complete
successor variety for the word read.
Prefix
Successor Variety
Letters
R
3
E,I,O
RE
2
A,D
REA
1
D
READ
3
A,I,S
Table 1.1 Successor variety for word read
Once the successor variety for a
given word is determined then this
information is used to segment the word.
Hafer and Weiss discussed the ways of doing
this.
1. Cut Off Method: Some cutoff value is
selected and a boundary is identified
whenever the cut off value is reached.
2. Peak and Plateau method: In this method a
segment break is made after a character
whose successor variety exceeds that of the
characters immediately preceding and
following it.
3. Complete word method: Break is made
after a segment if a segment is a complete
word in the corpus.
iii) N-Gram stemmers
This method has been designed by
Adamson and Boreham. It is called as shared
digram method. Digram is a pair of
consecutive letters. This method is called n-
gram method since trigram or n-grams could
be used. In this method association measures
are calculated between the pairs of terms
based on shared unique digram.
For example: consider two words
Stemming and Stemmer
Stemming st te em mm mi in ng
Stemmer st te em mm me er
In this example the word
stemming has 7 unique digrams, stemmer has
6 unique digrams, these two words share 5
unique digrams st, te, em, mm ,me. Once the
number of unique digrams is found then a
similarity measure based on the unique
digrams is calculated using dice coefficient.
Dice coefficient is defined as
S=2C/(A+B)
Where C is the common unique digrams, A is
the number of unique digrams in first word;
B is the number of unique digrams in second
word. Similarity measures are determined for
all pairs of terms in the database, forming a
similarity matrix. Once such a similarity
matrix is available, the terms are clustered
using a single link clustering method.
iv) Affix Removal Stemmers
Affix removal stemmers removes
the suffixes or prefixes from the terms
leaving the stem. One of the example of the
affix removal stemmer is one which removes
the plurals form of the terms. Some set of
rules for such a stemmer are as follows
(Harman)
a) If a word ends in “ies” but not “eies” or
“aies ”
Then “ies” -> “y”
b) If a word ends in “es” but not “aes”, or
“ees” or “oes”
Then “es” -> “e”
c) If a word ends in “s” but not “us” or “ss ”
Then “s” -> “NULL”
V. Conclusion
In this work we have presented
efficient preprocessing techniques. These
pre-processing techniques eliminates noisy
from text data, later identifies the root word
for actual words and reduces the size of the
text data. This improves performance of the
IR system.
References
1.Vishal Gupta , Gurpreet S. Lehal “A Survey of Text
Mining Techniques and Applications” Journal of
Emerging technologies in web intelligence, vol,1 no1
August 2009.
2. Durmaz,O.Bilge, H.S “Effect of dimensionality
reduction and feature selection in text classification
in IEEE conference ,2011, Page 21-24 ,2011.
3. G.Salton. The SMART Retrieval System:
Experiments in Automatic Document Processing.
Prentice-Hall, Inc.
4. Paice Chris D. “An evaluation method for stemming
algorithms”. Proceedings of the 17th annual
international ACM SIGIR conference on Research and
development in information retrieval. 1994, 42- 50.
5. J. Cowie and Y. Wilks, Information extraction, New
York, 2000.
6. Ms. Anjali Ganesh Jivani “A Comparative Study of
Stemming Algorithms” Int. J. Comp.
Tech. Appl., Vol 2 (6), 1930-1938
... Kannan [17] related Word2vec in helping designers to arrange bug seriousness level. is concentrate additionally investigates the impact of averaging procedures on classifier execution, particularly hyperparameter which further develops execution of classifiers on tuning suitably. ...
Article
Full-text available
Manual investigation is warranted in traditional approaches for estimating the bug severity level, which adds to the effort and time required. For bug severity report prediction, numerous automated strategies have been proposed in addition to manual ones. However, the current bug report predictors by facing several issues, such as overfitting and weight computation, and therefore, their efficiency for specific levels of data noise needs to improve. As a result, a bug report predictor is required to solve these concerns (e.g., overfitting and avoiding weight calculation, which increases computing complexity) and perform better in the situation of data noise. We use GPT-2’s features (limiting overfitting and supplying sequential predictors rather than weight computation) to develop a new approach for predicting the severity level of bug reports in this study. The proposed approach is divided into four stages. First, the bug reports are subjected to text preprocessing. Second, we assess each bug report’s emotional score. Third, each report is presented in vector format. Finally, an emotion score is assigned to each bug report, and a vector of each bug report is produced and sent to GPT-2. We employ statistical indicators like recall, precision, and F1-score to evaluate the suggested method’s effectiveness and efficacy. A comparison was also made using state-of-the-art bug report predictors such as Random Forest (RF), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) Network, Support Vector Machine (SVM), XGBoost, and Naive Bayes Multinomial (NBM). The proposed method’s promising result indicates its efficacy in bug information retrieval.
... The purpose of separating a string is to search for words in a sentence. The list of tokens becomes an input for further processing, such as parsing or text analysis (Kannan & Gurusamy, 2014). ...
Chapter
Full-text available
Telecommuting is a continuously researched topic, contributing to both the organization and the environment and economy. With the COVID-19 pandemic, there have been essential radical changes in this title. In particular, recruitment and retention, employee productivity and behavior, employee workforce size, how much the process will grow, gains and benefits, and working difficulties are many questions investigated by managers and are uncertain. This study deals with comparing topics modeling, which is one of the text mining methods, and the topics that can help institutions' strategy plan before and after the COVID-19 pandemic. The data collected on Twitter according to certain rules were analyzed by applying Latent Dirichlet Allocation method, which is the subject modeling method. According to the results obtained, it was found that telecommuting was more agenda after COVID-19, the topics before and after the pandemic differed, and video conferencing tools and cybersecurity came to the fore. Besides, since the pre-pandemic titles and literature researches are compatible with each other, it is proof that the titles obtained for the post-pandemic have substantial implications for the evaluation of institutions and managers.
... The purpose of separating a string is to search for words in a sentence. The list of tokens becomes an input for further processing, such as parsing or text analysis (Kannan & Gurusamy, 2014). ...
... We used the traditional document pre-processing in NLP (Kannan and Gurusamy, 2014): ...
Article
Hate speech is any kind of communication that attacks a person or a group based on their characteristics, such as gender, religion and race. Due to the availability of online platforms where people can express their (hateful) opinions, the amount of hate speech is steadily increasing that often leads to offline hate crimes. This paper focuses on understanding and detecting hate speech in underground hacking and extremist forums where cybercriminals and extremists, respectively, communicate with each other, and some of them are associated with criminal activity. Moreover, due to the lengthy posts, it would be beneficial to identify the specific span of text containing hateful content in order to assist site moderators with the removal of hate speech. This paper describes a hate speech dataset composed of posts extracted from HackForums, an online hacking forum, and Stormfront and Incels.co, two extremist forums. We combined our dataset with a Twitter hate speech dataset to train a multi-platform classifier. Our evaluation shows that a classifier trained on multiple sources of data does not always improve the performance compared to a mono-platform classifier. Finally, this is the first work on extracting hate speech spans from longer texts. The paper fine-tunes BERT (Bidirectional Encoder Representations from Transformers) and adopts two approaches – span prediction and sequence labelling. Both approaches successfully extract hateful spans and achieve an F1-score of at least 69%.
Article
In this paper, we present a survey of modeling and simulation approaches to describe information retrieval basics. We investigate its methods, its challenges, its models, its components and its applications. Our contribution is twofold: on the one hand, reviewing the literature on discovery some search techniques that help to get pertinent results and reach an effective search, and on the other hand, discussing the different research perspectives for study and compare more techniques used in information retrieval. This paper will also shedding the light on some of the famous AI applications in the legal field.
Chapter
Stop word removal is a basic morphological analysis or a process used to remove those words that appear frequently or repeatedly within the document text. These words do not affect generating language expressions but appear many times in default language transformation and communication. This caused us to acquire more space in memory and the allocation of the same words caused problems with memory indexing. In the case of big data analysis and information management applications, identification of such particular words present within the document is an important task. In this paper, text analysis was carried out to identify various stop words within the text. It is used as a pre-processing operation for the development of various natural language text processing and data mining techniques. This is important in part-of-speech tagging, sentiment analysis, content classification, semantic analysis, search engines, text summarization, named entity recognition, etc. Our proposed system discusses the role of text analysis techniques and pre-processing of language by removing stop words from a given input document written in Marathi text using a linguistic and rule-based approach. The system is particularly discussed for Marathi. Very little work has been done on stop word removal using a supervised approach and evaluating its impact on the performance of information retrieval systems and summarizing the text. The result get accuracy at 32.6% to compress the text.
Conference Paper
Full-text available
The devastating effect of spreading fake news related to politics, health, and customer reviews cannot be neglected over social media on the decision-making approach of an individual. The problem of fake news needs the attention of social media administrators, law enforcement agencies, and academic researchers. To handle this issue, researchers suggested various artificial intelligence techniques. However, most of the studies used only a specific type of news that leads to dataset biases. This study used three different standard datasets collected from Kaggle and GitHub. Preprocessed the datasets to remove unwanted text. Then these preprocessed datasets are applied on three classifiers: passive aggressive, machine learning, and naïve Bayes of 30-70, 40-60, 50-50, 60-40, and 70-30, respectively. To evaluate the performance accuracy, precision and recall are used. Results clearly show that this study outperforms the state-of-the-art techniques.
Article
Full-text available
Text datasets come in an abundance of shapes, sizes and styles. However, determining what factors limit classification accuracy remains a difficult task which is still the subject of intensive research. Using a challenging UK National Health Service (NHS) dataset, which contains many characteristics known to increase the complexity of classification, we propose an innovative classification pipeline. This pipeline switches between different text pre-processing, scoring and classification techniques during execution. Using this flexible pipeline, a high level of accuracy has been achieved in the classification of a range of datasets, attaining a micro-averaged F1 score of 93.30% on the Reuters-21578 “ApteMod” corpus. An evaluation of this flexible pipeline was carried out using a variety of complex datasets compared against an unsupervised clustering approach. The paper describes how classification accuracy is impacted by an unbalanced category distribution, the rare use of generic terms and the subjective nature of manual human classification.
Article
Development on Rinca Island by the Indonesian Government has received a lot of reaction from the community. Masses expressed their opinion through social media, especially Twitter regarding the matter. The research was conducted to analyze the public’s sentiment about this development which was divided into three categories: pro, contra, and neutral. There are two Doc2Vec models used in this research, the distributed model, and the distributed bag of words, and using support vector machines and logistic regression as classifiers. Each combination of the models and classifier has an accuracy rate above 75% and shows that almost all are against the development of Rinca Island.
Article
Full-text available
Stemming is a pre-processing step in Text Mining applications as well as a very common requirement of Natural Language processing functions. In fact it is very important in most of the Information Retrieval systems. The main purpose of stemming is to reduce different grammatical forms / word forms of a word like its noun, adjective, verb, adverb etc. to its root form. We can say that the goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. In this paper we have discussed different methods of stemming and their comparisons in terms of usage, advantages as well as limitations. The basic difference between stemming and lemmatization is also discussed
Article
Full-text available
Text Mining has become an important research area. Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. In this paper, a Survey of Text Mining techniques and applications have been s presented.
Conference Paper
The goal of classifying text or generally data is to decrease the time of access to the information. Continuously increasing number of documents makes the classification process impossible to do manually. In this case, the automatic text classification systems are activated. In these systems, large data space is an important problem. By using dimensionality reduction techniques and feature selection in text classification systems, it is possible to do right classification with reduced size of data. In this study, Discrete Cosine Transform (DCT) method and the feature selection with Proportion of Variance method are proposed to get more effective results for classification results and short classification time is aimed. In experimental studies WebKB and R8 datasets in Reuters-21578 are used. By using DCT method classification success is highly preserved and with Proportion of Variance method classification success increase.
Conference Paper
The effectiveness of stemming algorithms has usually been measured in terms of their effect on retrieval performance with test collections. This however does not provide any insights which might help in stemmer optimisation. This paper describes a method in which stemming performance is assessed against predefine concept groups in samples of words. This enables various indices of stemming performance and weight to be computed. Results are reported for three stemming algorithms. The validity and usefulness of the approach, and the problems of conceptual grouping, are discussed, and directions for further research are identified.