ArticlePDF Available

A Brief Survey of Machine Learning Algorithms for Text Document Classification on Incremental Database

Authors:

Abstract

Exponential growth of data in recent time is a very critical and challenging issue which requires significant attention. More than two third available information is stored in unstructured format largely in text formats. Knowledge can be extracted from many different available sources. Data which are mainly in unstructured format remain the largest readily available source of knowledge either online or offline so it must be attended very carefully. Text mining is believed to have a very high commercial potential value. Text classification is the process of classifying the text documents according to predefined categories. There are many databases which are dynamic and getting updated during course of time. To handle and classify these kind of datasets incremental learning is required which train the algorithm as per the arrival of new data. This paper covers different machine learning algorithms for text classification on the dynamic or incremental database also includes classifier architecture and Text Classification applications.
May June 2020
ISSN: 0193-4120 Page No. 25246 25251
25246
Published by: The Mattingley Publishing Co., Inc.
A Brief Survey of Machine Learning
Algorithms for Text Document Classification on
Incremental Database
Nihar M. Ranjan Midhun Chakkaravarthy
Post Doc. Scholar Faculty of Engineering
Lincoln University, Malaysia Lincoln University, Malaysia
Article Info
Volume 83
Page Number: 25246 25251
Publication Issue:
May - June 2020
Article History
Article Received: 11 May 2020
Revised: 19 May 2020
Accepted: 29 May 2020
Publication: 12 June 2020
Abstract
Exponential growth of data in recent time is a very critical and challenging
issue which requires significant attention. More than two third available
information is stored in unstructured format largely in text formats.
Knowledge can be extracted from many different available sources. Data
which are mainly in unstructured format remain the largest readily
available source of knowledge either online or offline so it must be
attended very carefully. Text mining is believed to have a very high
commercial potential value. Text classification is the process of classifying
the text documents according to predefined categories. There are many
databases which are dynamic and getting updated during course of time.
To handle and classify these kind of datasets incremental learning is
required which train the algorithm as per the arrival of new data. This
paper covers different machine learning algorithms for text classification
on the dynamic or incremental database also includes classifier
architecture and Text Classification applications.
1. Introduction
Text Mining is extracting of valuable and yet
hidden information from the text documents [2].
Text classification is one of the important
research issues in the field of text mining also
known as text categorization [6]. With the
exponential increase in the amount of data
contents available in digital forms from the
different sources leads to a problem to manage
this huge amount of online textual data [4]. So it
has become necessary to analyze these
voluminous data by applying
classification/categorization algorithms and find
some sense from these large texts (documents).
Text Classification is a machine learning
technique which assign a text document to one
or more of a set of predefined classes [3]. At the
outset, the text classification is carried in two
ways i.e. manual and automatic. In the manual
classification, the documents are categorized by
concerning the knowledge and the interest of the
person who predicts the facts to be considered as
important. On the other hand, the automatic
(using algorithms) text categorization schemes
are capable of categorizing the document with
prevalent importance considering time efficient
and improved classification accuracy [7]. There
are several methodologies available for the
effective (automatic) text categorization. The
purpose of this section is to identify, classify and
discuss current research work. There are several
May June 2020
ISSN: 0193-4120 Page No. 25246 25251
25247
Published by: The Mattingley Publishing Co., Inc.
distinct methods available for the effective text
classification. Text classification is the process
of automatic sorting the set of available
documents into some predefined categories [8].
It also includes automatic indexing, filling
patterns, selection of dissemination, author
attribution, survey of coding, grading of essay
etc. The text classification is broadly divided
into two categories, namely static data based
learning and incremental data-based learning.
Among the various text classification
methodologies, the incremental (dynamic)
learning has gained more interest because of its
significance and diversified applicability [7].
2.0 Text Classification Based on Incremental
Data-Based Learning
In this section we discussed different text
classification algorithms based on incremental
database learning. The machine learning
algorithms such as Support Vector Machine,
Neural Network, Random Forest, Fuzzy and
Probabilistic are explored.
2.1 Random Forest
The existing text classification researchers
utilizing the random forest algorithm are
discussed in this subsection. S. Thamarai Selvi
et al. [12] designed and developed a hybrid text
classification algorithm by integrating the
Rocchio algorithm and RF algorithm for
implementing the multi-label text classification.
The Rocchio algorithm is a supervised learning
method which usage a set of vectors as an input
dataset and classify it on the basis of cosine
similarity. Stop word remover and word
stemmer are used to overcome the limitations of
this algorithm. The RF algorithm is an
incremental database learning method based on
decision tree. It randomly select datasets and
features for the training purpose, and then assign
the suitable category from the identified class.
During text document classification, the input
vector is passed through all the trees in the
forest, the class with the maximum votes is
termed as the output. This integrated method has
overcome the limitations of both the algorithms
present during individual implementation. This
algorithm has more potential in comparison with
other available methods like Naïve-Bayes,
Multi-label KNN (ML-KNN) and fuzzy
relevance clustering in terms of accuracy and
effectiveness.
2.2 Neural Network
A neural network classifier is basically network
of neurons. In the network input layer contains
the feature terms, the output layer represents
predefined target categories and the connections
between the neuron units are represented by the
assigned weights. The feature vector of the text
document is assigned to the input layer neurons,
activation function (sigmoid) is applied to these
units and forwarded through the network. The
value at the output layers determines the actual
category of the input text documents. Feed
forward and Back propagation are two typical
methods of training the neural network[11].
ZhiHang Chen et al. [13] presented an
Incremental Learning of Text Classification
(ILTC). A framework is developed which learn
the features of text class and then followed an
incremental perceptron learning method. This
algorithm has the capability of learning the new
feature dimensions and new document classes
incrementally. ILTC could learn from the new
training data coming over period of time without
referring or forgetting the previously learned
training. This new learning method is
economical and feasible for both temporal and
spatial database as it does not require the need
for pre-processing and storing the old instances.
Author had discussed two learning phases of this
method, first is the incremental learning of text
classification features and second incremental
neural learning of the features and newly
introduced classes [5].
Patrick Marques Ciarelli et al. [21] proposed an
incremental neural network. It is based on
evolving Probabilistic Neural Network (EPNN),
which take care the multi-label issue in text
classification. EPNN is a neural network with
compressed architecture supports an
incremental learning algorithm. It has only one
iteration for the training phase. EPNN had the
capability of continuous learning with the
reduced architecture. It always receive the
training data even in non-availability of growth
in the network architecture. The method required
only a minimum number of weights and
parameters and the architecture was also
May June 2020
ISSN: 0193-4120 Page No. 25246 25251
25248
Published by: The Mattingley Publishing Co., Inc.
comparatively stable as there was no
proportional growth in a total number of weights
to the total number of training instances. The
computational cost is constant even with the
increase in the number of training instances.
EPNN performance is better than the other
existing methods considered for evaluation with
reference of the five parameters designed for
multi-label issues in web pages. The complexity
of the network architecture is low in
comparison with other similar methods.
2.3 Using fuzzy system
In this subsection existing work on fuzzy
system for text document classification is
discussed. Nurfadhlina Mohd Sharefa and
Trevor Martin [14] proposed an Evolving Fuzzy
Grammar (EFG) method of text classification of
crime related documents by using incremental
learning. Incremental learning method was
modelled on the basis of fuzzy grammar which
is created by the transformation of a set of
chosen text fragments. The identified grammars
were integrated and used to find the matching
with the learned fuzzy grammar and the testing
dataset, further categorization was carried out on
the basis of degree of parsing membership. As
the derivation, parsing and grammar matching
involved uncertainty; this fuzzy notion was used.
The fuzzy union operator was used for
integrating and transforming the individual text
fragment grammars into more common
representations of the already learned text
fragments. The learned fuzzy grammar set was
dependent on the existing pattern evolution. The
results shown that the Evolving Fuzzy Grammar
algorithm provided almost equivalent results and
performance of the Machine learning
algorithms. The EFG can be easily integrated
into a more comprehensive grammar system; it
was highly interpretable and has a low retraining
adaptability time. The method was efficient and
effective over the other related methodologies
for text categorization, and it can considered as a
potential technique for machine learning, text
segment expression and representation.
2.4 Using probability theory
Bayesian classifier are also known as Naïve-
Bayes classifier and it is based on Bayes
probability theorey which has strong
independent assumptions. Bayesian classifier
estimates the joint probability of a given
document belonging to a specific target class
[9].
Renato M. Silva et al. [15] proposed a new
algorithm of Minimum Description Length Text
(MDLText) based on the principle of minimum
description length for incremental text
categorization. MDLText is a lightweight and
fast multinomial algorithm which is highly
scalable and efficient text classifier exhibiting
fast incremental learning. A comprehensive grid
search assuring a fair comparison was employed
for the setting of best term weighting method
and parameters for every dataset and method.
MDLText method is found robust, overcome
the issue of overfitting and have low
computational cost. This method attained a
significant balance between the computational
efficiency of the algorithm and the predictive
power. It has the better power of prediction and
superior efficiency. All these features
collectively made this method applicable in most
of the online and real-world text classification
on a large scale basis. This scheme was efficient
in terms of time complexity, and the statistical
evidence proved that it outperformed the other
equivalent methods.
Farhad Pourpanah [16] proposed a multi-agent
categorization system named Q-learning Multi-
Agent Classifier System (QMACS) for
minimizing the issues of data classification. The
formulation of trust measurement is made by
integrating the Bayesian formalism, belief
functions and Q-learning. The big O-notation
method is used for analyzing the time
complexity of the algorithm. The bootstrap
method with confidence levels of 95% is used
for the statistical analysis of performance. For
the negotiation between agents and agent teams
of QMACS, a “sealed-bid first price auction”
method is applied . The predictions from
learning agents were combined for enhancing
the QMACS’s effectiveness and overall
categorization performance. The results revealed
the fact that the other schemes behave
differently than the QMACS behaviour of
integrating predictions for tackling distinct
benchmark issues. This system had less time
complexity in the training phase than the other
May June 2020
ISSN: 0193-4120 Page No. 25246 25251
25249
Published by: The Mattingley Publishing Co., Inc.
methods because of the involvement of many
agents in architecture.
Renato M. Silva et al. [19] designed and
developed a classifier Minimum Description
Length Text (MDLText) based on the minimum
description length for the filtration of
disagreeable short text messages. This method
was integrated with the incremental learning
which made the predictive model scalable and
continuously adaptive to the upcoming
spamming methods. This method provides faster
performance along with a linear increment in
computational cost with the increase in a total
number of features and samples. The outcome
shows that the classifier with incremental
learning enhanced the performance of the text
classifier. The results statistics shows that the
MDLText classifier provides have better
performance than the Online Gradient Descent
(OGD), Stochastic Gradient Descent (SGD),
Approximate Large Margin Algorithm (ALMA),
perceptron and Relaxed Online Maximum
Margin Algorithm (ROMMA). This classifier
was robust, overcome the overfitting issue
during text categorization and had minimum
computational complexity. The limitations of
this method were high time complexity and
dependence on the dictionary and TF-IDF
method.
2.5 Using support vector machine
Support vector machine(SVM) is a linear
classifier which is simple and effective as it
work well with both positive and negative
training data set to prepare the decision surface
(hyper plane) that best separates the positive
data from the negative one. This property is
very much uncommon with the other classifier
and it makes SVM a unique classifier. The data
which is closest to the decision surface are
called the support vec tor.
Wenbo Guo et al. [17] proposed an innovative
and active learning algorithm integrated with
incremental learning for solving the text
categorization problems. SVM was used for
estimating the information within the sample
data, and it performed the learning of linear
classifier with typical kernel-based feature
space. The SVM has the accuracy with the
optimal solution, and these things made it
appropriate for active learning. Active learning
was a machine learning algorithm, which was
learned by the spontaneous selection of data, and
utilized the distribution feature of datasets. The
first step was the spectral clustering, which
divided the dataset into two categories and the
training of this classifier was done by using the
labelled instances positioned at the category
boundary. The incremental learning was
integrated with the active learning for
minimizing the computational cost and
increasing the classification accuracy. This
method was stable with minimal error rate and
superior in terms of accuracy and effectiveness
of the other state of the art active learning
schemes.
2.6 Using hybrid model
Fabio Rangel et al. [18] developed a semi-
supervised learning based on Wilkie, Stonham &
Aleksander's Recognition Device (WiSARD)
classifier (SSW) for text categorization. Because
of the availability of variation in class
distribution, the semi-supervised learning was
satisfactory in the context of text categorization
in social networks. The WiSARD classifier was
a weightless neural network consists of
individual classifiers and named discriminators
assigned for learning the binary pattern of each
category. This WiSARD acted as a one-shot
classifier and permitted the incremental learning
by adding training patterns to content. It
performed text categorization in both unlabeled
data and labelled data. The scheme was fifty
times speedier than Expectation Maximization
Naive Bayes (EM-NB) and Semi-supervised
Support Vector Machine (S3VM) along with
highly competitive accuracy. In all the tested
datasets, the SSW portrayed a superior fitting
time and better standard deviation.
Jianqiang Li et al. [20] presented a hybrid
model, named Mixed Word Embedding (MWE),
on the basis of word2vec toolbox for the text
classification. This MWE integrated the two
variants continuous skip gram model (SKIP-
GRAM) and continuous bag-of-words model
(CBOW) of word2vec. They both shared a
common structure of encoding with the ability to
capture the accurate syntax information of
words. Additionally, MWE included the global
text vector with the CBOW variant for capturing
additional semantic information. The time
May June 2020
ISSN: 0193-4120 Page No. 25246 25251
25250
Published by: The Mattingley Publishing Co., Inc.
complexity of MWE was similar to the time
complexity of SKIP-GRAM variant. The model
was studied in the application and linguistic
perspectives for the effective evaluation of
MWE scheme. The empirical studies were
conducted on the word similarities and word
analogies for the linguistics. From the
application point of view, the learned latent
representations of sentiment analysis and
document classification were regarded. MWE
methodology was very competitive to most of
the traditional classifiers, like a glove, SKIP-
GRAM, and CBOW. The proximity and
ambiguity among words were not considered, it
was one of the limitations of this MWE scheme.
In short, among the various text classification
methodologies, the incremental (dynamic)
learning has gained more interest because of its
significance and diversified applicability [5].
3.0 Conclusions
Different existing works related to text
document classification algorithms and the
description of those works is briefly explained in
this paper. The text document classification
technique was evolved for organizing,
structuring and classifying the large corpus of
unstructured text. The text classifier assigns the
test document with one or more predefined
target categories. The curse of dimensionality
and the broad semantic meaning along with the
multiple context of the english words are some
major challenges which needs to be addressed.
Sparsity is one of the prevalent challenges of the
text classification techniques. Moreover, the
dimensionality remains high even after the
removal of the stop words filtering and
stemming. Because of the high dimensionality,
the time complexity increases. These text
classification methods enhance the classification
accuracy and reduce the training and
classification time by adopting different
strategies and optimization algorithms.
4.0 References
[1] K. S. Deepashri and Ashwini Kamath,
"Survey on Techniques of Data Mining and its
Applications," International Journal of Emerging
Research in Management & Technology, Vol.6,
No.2, February 2017.
[2] Ramzan Talib, Hanify, Ayeshaz et al., "Text
Mining: Techniques, Applications, and Issues,"
International Journal of Advanced Computer
Science and Applications (IJACSA), Vol.7, No.
11, 2016.
[3] R. Sagayam, "A Survey of Text Mining:
Retrieval, Extraction and Indexing Techniques,"
International Journal of Computational
Engineering Research, Vol.2, No.5, 2012.
[4] W. Berry Michael, “Automatic Discovery of
Similar Words: Survey of Text Mining:
Clustering, Classification and Retrieval”,
Springer, PP. 24-43, 2004.
[5] ZhiHang Chen, Liping Huang et al.,
"Incremental Learning for Text Document
Classification," In the Proceedings of
International Joint Conference on Neural
Networks, pp. 12-17, August 2007.
[6] Miji K Raju, Sneha T Subrahmanian et al. ,
"A Comparative Survey on Different Text
Categorization Techniques," International
Journal of Computer Science and Engineering
Communications, Vol.5, No.3, pp. 1612-1618,
2017.
[7] Yung-Shen Lin, Jung-Yi Jiang et al., "A
Similarity Measure for Text Classification and
Clustering," IEEE Transactions on Knowledge
and Data Engineering, Vol.26, No.7, July 2014.
[8] Said A. Salloum, Mostafa Al-Emran et al. "A
Survey of Text Mining in Social Media:
Facebook and Twitter Perspectives," Advances
in Science, Technology and Engineering
Systems Journal, Vol. 2, No. 1, pp. 127-133,
2017.
[9] B. Tang, H. He et al., "A Bayesian
Classification Approach using Class-Specific
Features for Text Categorization," in IEEE
Transactions on Knowledge and Data
Engineering, vol. 28, no. 6, pp. 1602-1606, June
1 2016.
[10] Chunting Zhou, Chonglin Sun et al., "A C-
LSTM Neural Network for Text Classification,"
Computation and Language (cs.CL), November
2015.
[11] Alexis Conneau, Yann Le Cun et al., "Very
Deep Convolutional Networks for Text
Classification," In the Proceedings of 15th
Conference of European Chapter of the
Association for Computational Linguistics:
Vol.1, pp. 11071116, April 3-7, 2017.
May June 2020
ISSN: 0193-4120 Page No. 25246 25251
25251
Published by: The Mattingley Publishing Co., Inc.
[12] S.Thamarai Selvi, P. Karthikeyan et al.,
"Text Categorization using Rocchio Algorithm
and Random Forest Algorithm," In the
Proceedings of IEEE Eighth International
Conference on Advanced Computing (ICoAC),
2016.
[13] ZhiHang Chen, Liping Huang et al.,
"Incremental Learning for Text Document
Classification," In the proceedings of
International Joint Conference on Neural
Networks, Orlando, Florida, USA, August 12-
17, 2007.
[14] Nurfadhlina Mohd Sharef and Trevor
Martin, "Evolving fuzzy grammar for crime text
categorization," Journal of Applied Soft
Computing, Vol. 28, pp. 175-187, March 2015.
[15] Renato M , Tiago A. et al., "MDLText: An
Efficient and Lightweight Text Classifier,"
Knowledge-based Systems, Vol. 118, pp. 152-
164, 15 February 2017.
[16] Farhad Pourpanah, Choo JunTan, Chee
PengLim et al., "A Q-learning-based Multi-
Agent System for Data Classification," Applied
Soft Computing, Vol.52, pp. 519-531, March
2017.
[17] Wenbo Guo , Chun Zhong and Yupu Yang,
"Spectral Clustering based Active Learning with
Applications to Text Classification," In the
Proceedings of 8th International Conference on
Computer and Automation Engineering, Vol. 56,
2016.
[18] Fabio Rangel, Fabricio Firmino et al.,
"Semi-Supervised Classification of Social
Textual Data Using WiSARD," In the
proceedings of European Symposium on
Artificial Neural Networks ESANN,
Computational Intelligence and Machine
Learning, 27-29 April 2016.
[19] Nihar M. Ranjan, Rajesh S. Prasad,
Automatic Text Classification using BP-Lion
Neural Network and Semantic Word
Processing” Imaging Science Journal, Taylor &
Francis, ISSN: 1368-2199, Sept. 2017.
[20] Jianqiang Li, Jing Li et al., "Learning
Distributed Word Representation with Multi-
Contextual Mixed Embedding," Knowledge-
Based Systems, Vol. 106, pp. 220-230, August
2016.
[21] Nihar M. Ranjan, Rajesh S. Prasad,
LFNN: Lion Fuzzy Neural Network based
evolutionary model for text classification using
context and sense based features”, Applied Soft
Computing Journal, Elsevier, PP 994-1008,
ISSN: 1568-4946,July 2018.
... Classification is of three types: rule-based, machine learning-based and mixed [4,5]. ...
Article
Full-text available
The object of the study is the process of transferring state public services into electronic form, which is associated with the need to transfer from a document-based service model to a data-based service model. When modeling state public services, a common data model for describing public services offered in administration is being used. This model is based on the use of "Core Vocabularies" that are necessary for the classification of data and entities related to this subject area. Thus, the article considers the actual task of analyzing documents to recognize data that should be classified using "Core Vocabularies". To solve the stated problem, an algorithm has been developed that allows recognizing the data contained in documents based on document analysis. The data set associated with the document formed as a result of the analysis is classified using "Core Vocabularies" at the second stage of the algorithm. When creating the algorithm, the results of the analysis of research in data recognition and classification were taken into account. The article discusses an illustrative example and presents the results of data classification for the "Core Person Vocabulary". The practical worth of the developed algorithm is that it is being used in the algorithmic software developed for an information system for recognizing and classifying data in documents, which makes it possible to transfer to a new model of data-based representation of public services. The use of an information system for recognizing and classifying data in documents is of high importance in the processes of reengineering state public services, creating new public services, and transferring public services to electronic form. This leads to drastic increase in the efficiency of the state public service system. Ref. 5, pic. 4, tabl. 1
... The threats can happen in diverse areas like IT organizations, government entities, educational institutions, financial services, Wikipedia, and online social domains. Online anonymity gives rise to crucial attacks and minimizes social restrictions and paved the way for many-to-many intervention [22,27]. The presence of unidentified users is everywhere in the environment and they are faster in their execution when compared with the genuine users [22,28]. ...
Article
Full-text available
The tremendous development and rapid evolution in computing advancements has urged a lot of organizations to expand their data as well as computational needs. Such type of services offers security concepts like confidentiality, integrity, and availability. Thus, a highly secured domain is the fundamental need of cloud environments. In addition, security breaches are also growing equally in the cloud because of the sophisticated services of the cloud, which cannot be mitigated efficiently through firewall rules and packet filtering methods. In order to mitigate the malicious attacks and to detect the malicious behavior with high detection accuracy, an effective strategy named Multiverse Fractional Calculus (MFC) based hybrid deep learning approach is proposed. Here, two network classifiers namely Hierarchical Attention Network (HAN) and Random Multimodel Deep Learning (RMDL) are employed to detect the presence of malicious behavior. The network classifier is trained by exploiting proposed MFC, which is an integration of multi-verse optimizer and fractional calculus. The proposed MFC-based hybrid deep learning approach has attained superior results with utmost testing sensitivity, accuracy, and specificity of 0.949, 0.939, and 0.947.
... It uses a Naive approach to constant backdrop for didactic purposes and highlights a problem for information storage and retrieval. This, leads to the use of LSTM architecture, and in comparison to other recurrent net algorithms, it learns to solve complex artificial tasks [12]. ...
Article
Full-text available
The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.
... we compared these systems on various parameters. In the paper, Erwin Halim et al. [1] proposed a system that can combine the records among the different healthcare organizations. This system uses the famous Nonaka and Takeuchi model of knowledge conversion for integrating the medical information about the patient, which is distributed at a different organization, into a single but complete medical record about the patient. ...
Article
Full-text available
Nowadays digital data storage and digital communication are widely used in the healthcare sector. Since data in the digital form significantly easier to store, retrieve, manipulate, analyses, and manage. Also, digital data eliminate the threat of data loss considerably. These advantages pushing many hospitals to store their data digitally. But, as the patients reveal their private and important information to the doctor, it is very crucial to maintain the privacy, security, and reliability of the healthcare data. In this process of handling the data securely, several technologies are being used like cloud storage, data warehousing, blockchain, etc. The main aim of this survey is to study the different models and technologies in the healthcare sector and analyses them on different parameters like security, privacy, performance, etc. This study will help the new developing healthcare systems to choose appropriate technology and approach to build a more efficient, robust, secure, and reliable system.
... Then Gradient Boosting technique is applied which employs gradient descent algorithm to optimize the loss function. This algorithm is portable, supports all major programming languages and can be used to solve various types of problems [15]. ...
Article
Full-text available
As the world is becoming more digitalized with every sector using the internet to flourish their businesses, online transactions have become an inevitable part of life. There has been a steady rise in the number of online transactions and this will continue to increase in the future as well. One of the major modes of online transactions is credit cards and along with its extensive use comes its major drawback, that is, credit card fraud. Machine learning plays a vital role in detecting credit card frauds as it is not possible for banks to monitor every transaction. This paper explores different machine learning algorithms used to detect credit card frauds.
Article
Full-text available
Deep neural network recently created offers insight on how high-level picture representation may be automatically learned in raw pixels. Deep learning with Convolutional Neural Networks has showed significant potential in the classification and improvement of picture, but is frequently unfit for predictive modelling without spatial connections utilizing data. In order to organize high dimensional vectors in a compact way for CNN- based profound learning, we offer a method to representation. This study shows the use of deep neural networks to create a system that can identify different texture characteristics.
Article
Full-text available
It is known that different PC suites exist for different mobile companies. We propose a system in which we provide a common interface for different mobiles. Universal PC suite helps to us connect all mobile devices in a common interface which provides the same functionality as the normal PC suite. Different mobile devices can be connected in Universal PC suite and user can access their contacts, messages, phone and memory with the PC suite interface. This is done with the help of AT Commands. More than 90% of the phone modems support AT Commands and is detected by our Desktop with the help of HyperTerminal.
Article
Full-text available
Text to speech conversion is one of the applications of machine learning. It is widely used in search engines, standalone applications, web applications, chatbots and android applications. But still there is need to upgrade text to speech system so that we can get more interactive and user-friendly application. Traditional text to speech application has monotonous voice as output which does not has emotions in it and seems to be more mechanized. So, there is need to improvise the existing system by embedding the flavour of emotions in it. Existing text to speech cannot be used in story telling applications also it does not provide effective communication. Most of the Text to Speech systems are developed using algorithms such as Support Vector Machine (SVM), Naïve Bayes etc. Emotion Based Text to Speech System will help to improvise the existing Text to Speech system. With the help of machine learning and deep learning algorithm such as Recurrent Neural Network can be used for performing sentiment analysis and semantic analysis on the input text. We are going to use neural network which is more effective and help to maintain a relation between previous word and next word. Emotion based text to speech system will be able to identify four emotions 'happy', 'sad', 'angry' and 'neutral'. Emotion based text to speech system will be beneficial for educational purpose like listening stories from storytelling applications for young budding children. Emotion based text to speech is going to be serviceable for visually impaired individuals.
Article
Full-text available
Text mining has become a major research topic in which text classification is the important task for finding the relevant information from the new document. Accordingly, this paper presents a semantic word processing technique for text categorization that utilizes semantic keywords, instead of using independent features of the keywords in the documents. Hence, the dimensionality of the search space can be reduced. Here, the Back Propagation Lion algorithm (BP Lion algorithm) is also proposed to overcome the problem in updating the neuron weight. The proposed text classification methodology is experimented over two data sets, namely, 20 Newsgroup and Reuter. The performance of the proposed BPLion is analysed, in terms of sensitivity, specificity, and accuracy, and compared with the performance of the existing works. The result shows that the proposed BPLion algorithm and semantic processing methodology classifies the documents with less training time and more classification accuracy of 90.9%.
Conference Paper
Full-text available
Text categorization is a problem which can be addressed by a semi-super- vised learning classifier, since the annotation process is costly and pon- derous. The semi-supervised approach is also adequate in the context of social network text categorization, due to its adaptation to class distribu- tion changes. This article presents a novel approach for semi-supervised learning based on WiSARD classifier (SSW), and compares it to other al- ready established mechanisms (S3VM and NB-EM), over three different datasets. The novel approach showed to be up to fifty times faster than S3VM and EM-NB with competitive accuracies.
Article
Full-text available
Text mining has become one of the trendy fields that has been incorporated in several research fields such as computational linguistics, Information Retrieval (IR) and data mining. Natural Language Processing (NLP) techniques were used to extract knowledge from the textual text that is written by human beings. Text mining reads an unstructured form of data to provide meaningful information patterns in a shortest time period. Social networking sites are a great source of communication as most of the people in today's world use these sites in their daily lives to keep connected to each other. It becomes a common practice to not write a sentence with correct grammar and spelling. This practice may lead to different kinds of ambiguities like lexical, syntactic, and semantic and due to this type of unclear data, it is hard to find out the actual data order. Accordingly, we are conducting an investigation with the aim of looking for different text mining methods to get various textual orders on social media websites. This survey aims to describe how studies in social media have used text analytics and text mining techniques for the purpose of identifying the key themes in the data. This survey focused on analyzing the text mining studies related to Facebook and Twitter; the two dominant social media in the world. Results of this survey can serve as the baselines for future text mining research.
Article
Full-text available
In many areas, the volume of text information is increasing rapidly, thereby demanding efficient text classification approaches. Several methods are available at present, but most exhibit declining performance as the dimensionality of the problem increases, or they incur high computational costs for training, which limit their application in real scenarios. Thus, it is necessary to develop a method that can process high dimensional data in a rapid manner. In this study, we propose the MDLText, an efficient, lightweight, scalable, and fast multinomial text classifier, which is based on the minimum description length principle. MDLText exhibits fast incremental learning as well as being sufficiently robust to prevent overfitting, which are desirable features in real-world applications, large-scale problems, and online scenarios. Our experiments were carefully designed to ensure that we obtained statistically sound results, which demonstrated that the proposed approach achieves a good balance between predictive power and computational efficiency.
Article
Text classification is one of the popular techniques of text mining that labels the documents based on a set of topics defined according to the requirements. Among various approaches used for text categorization, incremental learning techniques are important due to its widespread applications. This paper presents a connectionist classification approach using context-semantic features and LFNN-based incremental learning algorithm for the text classification. The proposed technique considers a dynamic database for the classification so that the classifier can learn the model dynamically. This incremental learning process adopts Back Propagation Lion (BPLion) Neural Network, where it includes fuzzy bounding and Lion Algorithm (LA), for the feasible selection of weights. The effectiveness of the proposed method is analyzed by comparing it with the existing techniques, I-BP, FI-BP, and I-BPLion regarding accuracy and error, in a comparative analysis. As a result of the comparison, classification accuracies of 81.49%, 83.41%, 88.76%, and 95%; and minimum error values of 8.11, 7.49, 3.02, and 4.92 are possible to attain in LFNN, for 20 Newsgroup, Reuter datasets, WebKB, and RCV1 respectively.
Book
As the volume of digitized textual information continues to grow, so does the critical need for designing robust and scalable indexing and search strategies/software to meet a variety of user needs. Knowledge extraction or creation from text requires systematic, yet reliable processing that can be codified and adapted for changing needs and environments. Survey of Text Mining is a comprehensive edited survey organized into three parts: Clustering and Classification; Information Extraction and Retrieval; and Trend Detection. Many of the chapters stress the practical application of software and algorithms for current and future needs in text mining. Authors from industry provide their perspectives on current approaches for large-scale text mining and obstacles that will guide R&D activity in this area for the next decade. Topics and features: * Highlights issues such as scalability, robustness, and software tools * Brings together recent research and techniques from academia and industry * Examines algorithmic advances in discriminant analysis, spectral clustering, trend detection, and synonym extraction * Includes case studies in mining Web and customer-support logs for hot- topic extraction and query characterizations * Extensive bibliography of all references, including websites This useful survey volume taps the expertise of academicians and industry professionals to recommend practical approaches to purifying, indexing, and mining textual information. Researchers, practitioners, and professionals involved in information retrieval, computational statistics, and data mining, who need the latest text-mining methods and algorithms, will find the book an indispensable resource.
Conference Paper
The dominant approach for many NLP tasks are recurrent neural networks, in particular LSTMs, and convolutional neural networks. However, these architectures are rather shallow in comparison to the deep convolutional networks which have pushed the state-of-the-art in computer vision. We present a new architecture (VDCNN) for text processing which operates directly at the character level and uses only small convolutions and pooling operations. We are able to show that the performance of this model increases with the depth: using up to 29 convolutional layers, we report improvements over the state-of-the-art on several public text classification tasks. To the best of our knowledge, this is the first time that very deep convolutional nets have been applied to text processing.