Conference Paper

Optimal Stop Word Selection for Text Mining in Critical Infrastructure Domain

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Eliminating all stop words from the feature space is a standard practice of preprocessing in text mining, regardless of the domain which it is applied to. However, this may result in loss of important information, which adversely affects the accuracy of the text mining algorithm. Therefore, this paper proposes a novel methodology for selecting the optimal set of domain specific stop words for improved text mining accuracy. First, the presented methodology retains all the stop words in the text preprocessing phase. Then, an evolutionary technique is used to extract the optimal set of stop words that result in the best classification accuracy. The presented methodology was implemented on a corpus of open source news articles related to critical infrastructure hazards. The first step of mining geo-dependencies among critical infrastructures from text is text classification. In order to achieve this, article content was classified into two classes: 1) text content with geo-location information, and 2) text content without geo-location information. Classification accuracy presented methodology was compared to accuracies of four other test cases. Experimental results with 10-fold cross validation showed that the presented method yielded an increase of 1.76% or higher in True Positive (TP) rate and a 2.27% or higher increase in the True Negative (TN) rate compared to the other techniques.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... First one was proposed depending upon word occurrence, and the second one was based on statistics. Amarasinghe and Manic [16] proposed a tactic for identifying a domain-specific optimal set of stop words for the English language. Raulji et al. [17] followed a corpus-based method to eliminate stop word from the Sanskrit language. ...
... They are triggered symbol for this FA. When the machine sees E, it moves into state r (16) . When machine sees F, it moves into state r (17) . ...
... If state r (17) is an accepted state, then the machine returns true. If, after E, there is a different triggered symbol for which the machine stuck into the same state r (16) , then the machine returns false. Returning true means from the starting node to the accepted node, the total path's triggered symbols are the required stop phrase. ...
Article
Full-text available
Though plenty of research works have been done on stop word/phrase detection, there is no work done on Bengali stop words and stop phrases. This research innovates the definition and classification of Bengali stop words and phrases and implements two approaches to identify them. First one is a corpus-based approach, while the second one is based on the finite-state automaton. Performance of both approaches is measured and compared. Result analysis shows that corpus-based method outperforms the finite-state automaton-based method. The corpus-based and finite-state automaton-based method shows 90% and 80% of accuracy, respectively, for stop word detection and 80% and 70% accuracy, respectively, for stop phrase detection.
... Many articles on the bag-of-words method [1,4,7,18,23] show that an integral part of the algorithm is the processing of stop-words. In Amarasinghe, Manic and Hruska [23] this stage was given special attention. ...
... Many articles on the bag-of-words method [1,4,7,18,23] show that an integral part of the algorithm is the processing of stop-words. In Amarasinghe, Manic and Hruska [23] this stage was given special attention. They emphasized that the removal of the words leads to the loss of some useful information. ...
... Therefore, it was proposed that an alternate method, in which the stop-words are considered separately from the key words and the dimension is reduced using a genetic algorithm, be used instead. In [23] experiments were carried out that showed that the accuracy of the algorithm increased by two percent. However, their experiments have been conducted on a fairly small amount of data and there are questions as whether or not the proposed method is effective and, most importantly, can it quickly reduce the dimension of stop-words with a large amount of data? ...
... Text mining is the process of generating useful information from text data (Amarasinghe et al., 2015). There are multiple methods and techniques for text mining in different languages. ...
... Stop words have multiple occurrences in many sentences and have no semantic importance in the context in which they appear. The concept of stop words was proposed in IR (Information Retrieval) systems first(Amarasinghe et al., 2015). There have been a lot ofwork for Urdu Language in the domain of Information Retrieval but removal of stop words is still unfocused. ...
Article
Full-text available
Stop words have multiple occurrences in many sentences and have least semantic importance in the context in which they appear. Stop words cover a major volume of documents having very little semantic importance. So they should be removed for better language processing and classification. In this research study, we have designed and proposed an efficient algorithm for the elimination of stop words from Urdu documents. There have been a lot of work in domains like natural language processing (NLP), sentence boundary disambiguation and stemming for Urdu language but we are unaware of any work or methodology proposed for the elimination of stop words from Urdu language. That is why we applied the algorithm proposed by (Al-Shalabi et al., 2004) on Urdu language. As per the best of our knowledge, we are the first to apply any kind of algorithm for Urdu language stop word elimination.
... The process in which we get the useful information from text document data and process the information is known as text mining [13]. There are various techniques for text mining in many other languages. ...
... In many sentences the stop words occur at different events and their semantic significance is no more important in the order they occur. Information retrieval framework gave the idea of stop words for the first time [13]. For Urdu language, there has been a great deal of work done yet, but the elimination process of stop words from Urdu language is still unfocused. ...
... The raw data obtained from the data scraping process needs to be processed first by the text preprocessing method, which is commonly used in natural language processing. The preprocessing methods used in this study include converting all characters to lowercase, eliminating single characters, symbols, and numbers, and filtering stopwords to reduce features and improve the performance of the topic modeling algorithm [12,13]. ...
... Manalu [9] conducted a comparative study between automatic review summarization with and without stop-words using TextRank. Amarasinghe et al. [10] have proposed a methodology for selecting optimal domain-specific stop-words for improved accuracy in text mining. Popova and Skitalinskaya [11] have proposed a key-phrase extraction-based methodology from short texts to create an extended list of stop-words. ...
Chapter
For the retrieval of information from different sources and formats, pre-processing of the collected information is the most important task. The process of stop-word elimination is one such part of the pre-processing phase. This paper presents, for the first time, the list of stop-words, stop-stems and stop-lemmas for Malayalam language of India. Initially, a corpus of Malayalam language was created. The total count of words in the corpus was more than 21 million out of which approximately 0.33 million were unique words. This was processed to yield a total of 153 stop-words. Stemming was possible for 20 words, and lemmatization could be done for 25 words only. The final refined stop-word list consists of 123 stop-words. Malayalam is a widely spoken language by people living in India and many other parts of the world. The results presented here are bound to be used by any NLP activity for this language. KeywordsMalayalamStop-wordsStemmingLemmatizationNatural language processing (NLP)
... Stop words, that is, common words that have no meaning or are less meaningful than other keywords, were removed. Removing stop words can sharpen the focus on essential words [127], reduce feature size, and improve accuracy [130,131]. The types of words considered in this study were limited to nouns, verbs, adverbs, and adjectives through part of speech (POS) filtering. ...
Article
Full-text available
The literature discussing the concepts, technologies, and ICT-based urban innovation approaches of smart cities has been growing, along with initiatives from cities all over the world that are competing to improve their services and become smart and sustainable. However, current studies that provide a comprehensive understanding and reveal smart and sustainable city research trends and characteristics are still lacking. Meanwhile, policymakers and practitioners alike need to pursue progressive development. In response to this shortcoming, this research offers content analysis studies based on topic modeling approaches to capture the evolution and characteristics of topics in the scientific literature on smart and sustainable city research. More importantly, a novel topic-detecting algorithm based on the deep learning and clustering techniques, namely deep autoencoders-based fuzzy C-means (DFCM), is introduced for analyzing the research topic trend. The topics generated by this proposed algorithm have relatively higher coherence values than those generated by previously used topic detection methods, namely non-negative matrix factorization (NMF), latent Dirichlet allocation (LDA), and eigenspace-based fuzzy C-means (EFCM). The 30 main topics that appeared in topic modeling with the DFCM algorithm were classified into six groups (technology, energy, environment, transportation, e-governance, and human capital and welfare) that characterize the six dimensions of smart, sustainable city research.
... -Elimination of stop words: The stop words are a set of words that provide little or no semantic meaning in the texts, they are generally the words that appear most frequently in a language and contain prepositions, pronouns, auxiliary verbs, etc. Eliminating stop words is a basic step in pre-processing to perform text mining, which, as the name suggests, consists of removing the stop words from the set of characteristics of the texts [1]. The catalog used contains 613 stop words in Spanish. ...
... The postfix and prefix words are given in the past, present and future tense. [13] Fuzzy multiple criteria decision making are combined with synthetic weight defined by human preference. [14] Fuzzy c means are researchers based algorithm compared to the other algorithm. ...
... statistical, word distribution in documents using variance measure and using the entropy measure. An evolutionary technique was proposed by [9] to extract the optimal set of stop words from the critical infrastructure domain. ...
Article
Full-text available
Most of the research in the field of information retrieval (IR) has focused on the English language, but recently there has been a considerable amount of work and effort to develop IR systems for languages other than English. Research and experimentation in the field of IR in the Hindi language are relatively new and limited compared to the research that has been done in English, which has been dominant in the field of IR for a long while. A fundamental tool in IR is the employment of stop word lists. Stop words have no retrieval value in IR. Till now, many stop word lists have been developed for English, European and Chinese languages. However, there is no standard stop word list which has been constructed for Hindi language. In this paper an approach to construct a generic stop word list for Hindi language have been presented. Our list contains more than 800 stop words.
... The stop words usually refer to the most commonly words in a language. There is no single uniform list of stop words used by all natural language processing tools [16]. Usually, the stop word is rarely meaningful or useful even though it is essential for some sentences to be grammatically correct or meaningful grammatically, such as "the", "this", "a" and "on". ...
Article
Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.
Thesis
Full-text available
The objective of the dissertation research is increasing the efficiency of the discovery and selection of the scientific publications in the process of formation of the scientific-supporting or recommender bibliographic indexes. To get there, the minimal terminologically saturated ordered publication set is collected with the designed information technology based on the developed mathematical model and the worked-out method of discovery and selection implemented as a software package. For the first time, the hybrid mathematical model of the process of bibliographic discovery and selection is developed.The distinctive feature of the model is the combination of iterative snowball iterations, probabilistic topic model of the text documents, citation network analysis, automatic term extraction, existence of the condition of the iteration stationary point, and the existence of a condition of terminological saturation in the domain of interest. Ukrainian full text available
Article
Предлагается оригинальная методика авторской атрибуции текстов на основе отклонений от распределения Ципфа и дается её описание с использованием статистических данных, полученных с помощью конкорданса и вычислений в табличном процессоре. Эта методика предусматривает нахождение расстояний между входными и эталонным текстами на основе анализа отклонений частотностей стоп-слов. Полученные результаты показали, что предлагаемая методика позволяет производить достоверную атрибуцию текстов и благодаря простоте вычислений может быть использована в образовательном процессе для развития у студентов навыков и компетенций, связанных с автоматической обработкой текстовых документов.
Article
The paper proposes an original methodology of authorship attribution based on the deviations from Zipf distribution and statistical data obtained with the help of a concordance program and computations performed in a table processor. The methodology involves finding distances between input texts and a refer-ence text basing on deviations of stop-words frequencies. The results that have been achieved prove that the proposed methodology allows performing efficient authorship attribution and that it can be used in the edu-cational process to develop student skills and competencies pertaining to natural language processing.
Conference Paper
Social media are a modern web-based application for communication and interaction between humans through audio messages, written messages, and video messages. These devices build and activate living communities around the world. People share their interests and activities with these Applications. Twitter is a social media site, where people communicate through tweets. A service that enables users to communicate in touch through the exchange of frequent tweets and quick. People publish their tweets on their profile and send their followers to express their thoughts and opinions about events in this world. It is crucial to study and categorize these tweets. This work is based on fuzzy logic and genetic algorithm to solve the problem of text classification based on the relevance degree. The Inputs for this classification system are a set of features extracted from a tweet, and the output of this system is a decision of classification for a tweet, which is a degree of relevance for each tweet to an appointed event where the degree of relevance to the desired event iftweet irrelevant or relevant. The results are compared with a method of keyword search and fuzzy logic, which is based on the method incremental rate and correction rate. In the incremental rate, the proposed system is able to extract tweets more than this method, wherein dataset 1, the number of the tweets that are extracted by the proposed system is 160 tweets, but the number of the tweets that extracted by the other one are 98 and 141. The correction rate of the proposed system is (98.75), but the correction rates of this method are (97.9) and (95.7).
Article
Full-text available
Classification of economic journal articles has been done using the VSM (Vector Space Model) approach and the Cosine Similarity method. The results of previous studies are considered to be less optimal because Stopword Removal was carried out by using a dictionary of basic words (tuning). Therefore, the omitted words limited to only basic words. This study shows the improved performance accuracy of the Cosine Similarity method using frequency-based Stopword Removal. The reason is because the term with a certain frequency is assumed to be an insignificant word and will give less relevant results. Performance testing of the Cosine Similarity method that had been added to frequency-based Stopword Removal was done by using K-fold Cross Validation. The method performance produced accuracy value for 64.28%, precision for 64.76 %, and recall for 65.26%. The execution time after pre-processing was 0, 05033 second.
Conference Paper
Full-text available
As critical and sensitive systems increasingly rely on complex software systems, identifying software vulnerabilities is becoming increasingly important. It has been suggested in previous work that some bugs are only identified as vulnerabilities long after the bug has been made public. These bugs are known as Hidden Impact Bugs (HIBs). This paper presents a hidden impact bug identification methodology by means of text mining bug databases. The presented methodology utilizes the textual description of the bug report for extracting textual information. The text mining process extracts syntactical information of the bug reports and compresses the information for easier manipulation. The compressed information is then utilized to generate a feature vector that is presented to a classifier. The proposed methodology was tested on Linux vulnerabilities that were discovered in the time period from 2006 to 2011. Three different classifiers were tested and 28% to 88% of the hidden impact bugs were identified correctly by using the textual information from the bug descriptions alone. Further analysis of the Bayesian detection rate showed the applicability of the presented method according to the requirements of a development team.
Article
Full-text available
Many different demands can be made of intrusion detection systems. An important requirement is that an intrusion detection system be effective; that is, it should detect a substantial percentage of intrusions into the supervised system, while still keeping the false alarm rate at an acceptable level. This article demonstrates that, for a reasonable set of assumptions, the false alarm rate is the limiting factor for the performance of an intrusion detection system. This is due to the base-rate fallacy phenomenon, that in order to achieve substantial values of the Bayesian detection rate P(Intrusion***Alarm), we have to achieve a (perhaps in some cases unattainably) low false alarm rate. A selection of reports of intrusion detection performance are reviewed, and the conclusion is reached that there are indications that at least some types of intrusion detection have far to go before they can attain such low false alarm rates.
Article
Full-text available
The attribute-value representation of documents used in Text Mining provides a natural framework for classifying or clustering documents based on their content. Supervised learning algorithms can be applied whenever the docu-ments have labels preassigned or unsupervised learning for unlabeled documents. The attribute-value representation of documents is characterized by very high dimensional data since every word in the document may be treated as an at-tribute. However, the representation of documents has a crucial influence on how well some supervised learning al-gorithm can generalize. This work presents a way to effi-ciently decomposing text into words (stems) using the bag-of-words approach as well as reducing the dimensionality of its representation, making text accessible to most Ma-chine Learning algorithms that require each example be described by a vector of fixed dimensionality. A compu-tational tool we have implemented is used on a real case in order to illustrate our proposal as well as several of the facilities implemented in the tool which allow to improve the accuracy of classifiers through the reduction of the di-mensionality of text representation.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Conference Paper
This paper proposes an estimation method for the confidence level of feedback information (CLFI), namely the confidence level of reported information in computer integrated manufacturing (CIM) architecture for logic diagnosis. We studied the factors affecting CLFI, such as the measurement system reliability, production context, position of sensors in the acquisition chains, type of products, reference metrology, preventive maintenance and corrective maintenance based on historical data and feedback information generated by production equipments. We introduced the new 'CLFI' concept based on the Dynamic Bayesian Network(DBN) approach, Naïve Bayes model and Tree Augmented Naïve Bayes model. Our contribution includes an online confidence computation module for production equipments data and an algorithm to compute CLFI.
Article
In this paper we propose an automated method for generating domain specific stop words to improve classification of natural language content. Also we implemented a bayesian natural language classifier working on web pages, which is based on maximum a posteriori probability estimation of keyword distributions using bag-of-words model to test the generated stop words. We investigated the distribution of stop-word lists generated by our model and compared their contents against a generic stop-word list for English language. We also show that the document coverage rank and topic coverage rank of words belonging to natural language corpora follow Zipf's law, just like the word frequency rank is known to follow.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Conference Paper
Summary form only given. We have designed and implemented an efficient stop-word removal algorithm for Arabic language based on a finite state machine (FSM). An efficient stop-word removal technique is needed in many natural language processing application such as: spelling normalization, stemming and stem weighting, Question answering systems and in information retrieval systems (IR). Most of the existing stop-word removal techniques bases on a dictionary that contains a list of stop-word, it is very expensive, it takes too much time for searching process and required too much space to store these stop-words. The new Arabic removal stop-word technique has been tested using a set of 242 Arabic abstracts chosen from the Proceedings of the Saudi Arabian National Computer conferences, and another set of data chosen from the holy Q'uran, and it gives impressive results that reached approximately to 98%.
Conference Paper
In the text preprocessing of text mining, a stop-word list is constructed to filter the segment results of the text documents so that the dimensionality of the text feature space can be cut down primarily. This paper summarized the definition, extraction principles and method of stop-word, and constructed a customizing Chinese-English stop-word list with the classical stop-word list based on the difference of text documents' domain. Three different filter algorithms were designed and implemented in the process of the stop-word filter and their efficiency was compared emphatically. The experiment indicated that the hash-filter method was the fastest.
Article
Written communication of ideas is carried out on the basis of statistical probability in that a writer chooses that level of subject specificity and that combination of words which he feels will convey the most meaning. Since this process varies among individuals and since similar ideas are therefore relayed at different levels of specificity and by means of different words, the problem of literature searching by machines still presents major difficulties. A statistical approach to this problem will be outlined and the various steps of a system based on this approach will be described. Steps include the statistical analysis of a collection of documents in a field of interest, the establishment of a set of “notions” and the vocabulary by which they are expressed, the compilation of a thesaurus-type dictionary and index, the automatic encoding of documents by machine with the aid of such a dictionary, the encoding of topological notations (such as branched structures), the recording of the coded information, the establishment of a searching pattern for finding pertinent information, and the programming of appropriate machines to carry out a search.
Conference Paper
Text classification is an active research area in information retrieval and natural language processing. A fundamental tool in text classification is a list of 'stop' words(stop word list) that is used to identify frequent words that are unlikely to assist in classification and hence are deleted during pre-processing. Till now, many stop word lists have been developed for English language. However, there is no standard stop word list which has been constructed for Chinese text classification yet. In this paper, we give a refined definition for stop words in Chinese text classification from a perspective of statistical correlation, then propose an automatic approach to extracting the stop word list in text classification based on the weighted Chi-squared statistic on 2*p contingency table. We evaluate the stop word lists using accuracies obtained from text classification experiments in the real-world Chinese corpus. The results show that the proposed approach is effective. The stop word lists derived by the approach can speed up the calculation and increase the accuracy of classification at the same time.
Article
David Goldberg's Genetic Algorithms in Search, Optimization and Machine Learning is by far the bestselling introduction to genetic algorithms. Goldberg is one of the preeminent researchers in the field--he has published over 100 research articles on genetic algorithms and is a student of John Holland, the father of genetic algorithms--and his deep understanding of the material shines through. The book contains a complete listing of a simple genetic algorithm in Pascal, which C programmers can easily understand. The book covers all of the important topics in the field, including crossover, mutation, classifier systems, and fitness scaling, giving a novice with a computer science background enough information to implement a genetic algorithm and describe genetic algorithms to a friend.
Conference Paper
Given a data set and a learning task such as classification, there are two prime motives for executing some kind of data set reduction. On one hand there is the possible algorithm performance improvement. On the other hand the decrease in the overall size of the data set can bring advantages in storage space used and time spent computing. Our purpose is to determine the importance of several basic reduction techniques on Support Vector Machines, by comparing their relative performance improvement when applied on the standard REUTERS-21578 benchmark.
Conference Paper
A recently proposed adaptive strategy for text recognition uses a linguistic fact that over half of the words on a typical English page are among 150 common stop words. The small lexicon permits word-shape based recognition that yields word identities from which character prototypes can be extracted. This paper describes a fast procedure for locating the best candidates for those stop words. The procedure uses width statistics of individual words and their immediate neighbors. In an experiment using 400 page images, the method removed 63% of the words from consideration. The stop/nonstop word discrimination also assists keyword spotting for information retrieval
Article
A number of recent data mining techniques have been targeted especially for the analysis of sequential data. Traditional examples of sequential data involve telecommunication alarms, Www log files, user action registration for Hci studies, or any other series of events consisting of an event type and a time of occurrence. Text can also be seen as sequential data, in many respects similar to the data collected by sensors, or other observation systems. Traditionally, texts have been analysed using various information retrieval related methods, such as full-text analysis, and natural language processing. However, only few examples of data mining in text, particularly in full text, are available. In this paper we show that general data mining methods are applicable to text analysis tasks under certain conditions. Moreover, we present a general framework for text mining. The framework follows the general Kdd process, thus containing steps from preprocessing to the utilization of...
Article
We propose a new adaptive strategy for text recognition that attempts to derive knowledge about the dominant font on a given page. The strategy uses a linguistic observation that over half of all words in a typical English passage are contained in a small set of less than 150 stop words. A small dictionary of such words are compiled from the Brown corpus. An arbitrary text page first goes through layout analysis that produces word segmentation. A fast procedure is then applied to locate the most likely candidates for those words, using only widths of the word images. The identity of each word is determined using a word shape classifier. Using the word images together with their identities, character prototypes can be extracted using a previously proposed method. We describe experiments using simulated and real images. In an experiment using 400 real page images, we show that on average, 8 distinct characters can be learned from each page, and the method is successful on 90% of all the pages. These can serve as useful seeds to bootstrap font learning.
Article
Recent approaches to text classification have used two di#erent first-order probabilistic models for classification, both of which make the naive Bayes assumption.
Reducing the Dimensionality of Bag-of-Words Text Representation Used by Learning Algorithms
  • C A Martins
  • M C Monard
  • E T Matsubara
C. A. Martins, M. C. Monard, E. T. Matsubara, "Reducing the Dimensionality of Bag-of-Words Text Representation Used by Learning Algorithms," in Proc of 3rd IASTED International Conference on Artificial Intelligence and Applications, pp. 228-233, 2003.