ArticlePDF Available


Automatic identification of influential segments from a large amount of data is an important part of topic detection and tracking (TDT). This can be done using keyword identification via collocation techniques, word co-occurrence networks, topic modeling and other machine learning techniques. This paper reviews existing traditional keyword extraction techniques and analyzes them to make useful insights and to give future directions for better automatic, unsupervised and language independent research. The paper reviews extant literature on existing traditional TDT approaches for automatic identification of influential segments from a large amount of data in keyword detection task. The current keyword detection techniques used by researchers have been discussed. Inferences have been drawn from current keyword detection techniques used by researchers, their advantages and disadvantages over the previous studies and the analysis results have been provided in tabular form. Although keyword detection has been widely explored, there is still a large scope and need for identifying topics from the uncertain user-generated data.
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2590 Shaikh : Keyword Detection Techniques: A Comprehensive Study
Keyword Detection Techniques
A Comprehensive Study
Zaffar Ahmed Shaikh
Faculty of Computer Science & Information Technology
Benazir Bhutto Shaheed University
Lyari, Karachi, Pakistan
Abstract—Automatic identification of influential segments from a
large amount of data is an important part of topic detection and
tracking (TDT). This can be done using keyword identification
via collocation techniques, word co-occurrence networks, topic
modeling and other machine learning techniques. This paper
reviews existing traditional keyword extraction techniques and
analyzes them to make useful insights and to give future
directions for better automatic, unsupervised and language
independent research. The paper reviews extant literature on
existing traditional TDT approaches for automatic identification
of influential segments from a large amount of data in keyword
detection task. The current keyword detection techniques used by
researchers have been discussed. Inferences have been drawn
from current keyword detection techniques used by researchers,
their advantages and disadvantages over the previous studies and
the analysis results have been provided in tabular form. Although
keyword detection has been widely explored, there is still a large
scope and need for identifying topics from the uncertain user-
generated data.
Keywords-keyword detection; information retrieval; topic
detection; machine learning; comprehensive study
Keyword extraction using manual methods is slow,
expensive and bristling with mistakes [1]. In recent years,
many automatic bursty keyword extraction techniques have
been proposed to extract keywords from large amounts of data.
These keywords are helpful in identifying themes and
influential segments and framing semantic web and other
applications of natural language processing [2, 3]. Automatic
keyword detection research area is related to topic detection
and tracking (TDT) domain which was proposed in [4].
Various applications use keyword extraction techniques for
web search, report generation and cataloguing [5]. This area is
intended to identify the most useful terms which include many
sub-processes. Documents are introduced in MS Word, html, or
pdf formats. Initially, the documents are pre-processed to
remove redundant and unimportant information [6, 7]. The data
is then processed through different keyword extraction
approaches including statistical approach, linguistic approach,
machine learning approach, network based approach and topic
modelling approach [8, 9].
In statistical approach, term frequency–inverse document
frequency (Tf-Idf) is the most widely used technique for
keyword extraction. The researchers use Tf-Idf to give a
document a score based upon some query. A change in score
occurs when a query is changed or updated. Without a query,
there is no score [10]. Recently many new techniques have
been developed for statistical keyword extraction [11]. These
include PageRank, LexRank, etc. In PageRank, the researchers
assign a score to a document based upon the documents it links
to, and the documents which link to it. It is a global ranking
scheme [10]. Therefore, in PageRank, the score does not
change (like in Tf-Idf) depending on the query used. As
observed, PageRank and LexRank algorithms perform better
than Tf-Idf. In linguistic approach, automatically identifying
keywords is similar to semantic resemblance [12]. In machine
learning approach, the keyword extraction technique is
considered as classification technique [13].
Different dictionaries including WordNet, SentiNet and
ConceptNet are used for keyword extraction techniques. In
network based algorithms, the nature and semantics of word
co-occurrence networks is studied to identify important terms.
In this, nodes are considered as words and edges are considered
as co-occurrence frequency [14]. Many useful insights have
been obtained from these algorithms for identifying influential
segments and keywords. Topic modelling techniques have been
popularized in [15]. Authors introduced Latent Dirichlet
Allocation technique which is used to identify which document
is related to which topic and to what extent [16]. This has been
further improved by Hierarchical Dirichlet Process, Pachinko
Allocation Model, Relational Topic Modeling, Conditional
Topic Random Fields and recently by Hierarchical Pitman–
Yor–Dirichlet Language Model and Graph Topic Model [17].
Although keyword extraction is an important area of research
and many researchers and practitioners gave a lot of attention
to it, state of the art keyword extraction method is still not
observed as compared to many other core natural language
processing tasks [18]. This paper reviews existing traditional
keyword extraction techniques and analyzes them to make
useful insights to give future directions for better automatic,
unsupervised and language independent research.
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2591 Shaikh : Keyword Detection Techniques: A Comprehensive Study
Authors in [19] developed a so-called tool ‘Keyword
Extractor’ for automatic extraction of most likely terms that
closely match experts’ preferences. Their study was related to
brain research which involved worldwide collaborations and
exchange of information among neuroinformatics centers and
portal sites. The main objective of their study was the efficient
use of resources and the improvement in the quality of brain
research. Each center and site developed their own set of
keywords for classification of the main text and the resources.
The researchers tested their tool over the abstract database of
two science journals. Authors in [20] extracted keywords from
a Chinese microblog. To extract keywords, they performed five
steps and used three features (i.e., graph model, semantic space,
and location of words). In the first step, researchers
downloaded microblog API of a user. Secondly, they
preprocessed the data by applying data cleaning, word
segment, POS tagging, and stop word removal techniques. To
extract keywords, researchers in the third step created a graph
model that was based on the co-occurrence between words.
They assigned sequence numbers to the words according to
their location and developed weight of the words by using the
score formula. In the fourth step, researchers first created a
semantic space that was based on topic extraction and then
computed statistical weight of the words by using Tf-Idf. In the
fifth and last step, researchers first identified location of words,
and then, based on the location of those words, computed the
rank value of each word. Authors in [21] focused on the
structure approach and graph generation. The approach used in
this paper is structure based in which researchers created graph
model and identified bursty topics and events. In topic
clustering, twitter tweets were separated to produce
homogeneous graphs and heterogeneous graphs. For
homogeneous graphs, researchers used OSLOM algorithm to
find interaction among users. For heterogeneous graphs,
rankclus algorithm was used to construct a set of tweets ranked
with number. Finally, from both graph results, the concept,
theme or event of a tweet was measured by joining tweets with
the same name. Researchers planned ahead to develop graph
models to be used for different types of events and to construct
a method that can define events.
Authors in [22] developed a keyword extraction technique
for tweets with high variance and lexical variant problems.
Lexical variants are examples of free variation in language.
They are characterized by similarity in phonetical or spelling
form and identity of both meaning and distribution. The
authors used brown clustering and continuous word vector
methods. In brown clustering method, they clustered words
having same meaning (such as no, noo, etc.) and then found out
the features for the individual cluster. In continuous word
vector method, the authors defined a layer by finding its
probability and then the word is changed into continuous word
vector. Next, they predicted the length of the keyword by
calculating the ratio between the number of keywords and the
total number of words in the tweets. In the end, linear
regression method was used to predict the number of
keywords. Authors in [23] developed a system to detect
popular keyword trend and bursty keywords. Their system
detects keyword abbreviations and any typing and spacing
errors. The first step they took is to collect the candidate
keywords (i.e., the first word starting with the capital letter or
the word enclosed in quotation mark is considered as candidate
keyword). The second step was to merge keywords. To do so,
they considered acronyms and typo and spacing errors, and
then, found out the Tf accordingly. Finally, they detected
popular keywords from the candidate keywords which were
merged, and then, selected bursty keywords using the burst
ratio technique. Authors in [24] gave the idea of TOPOL (a
topic detection method based on topology data analysis) which
identifies the irrelevant noisy data from the useful data. The
first step of the authors was the preprocessing step in which the
elimination of the hashtags, the URLs, and the non-textual
symbols from a tweet was done. Their second step was
mapping, in which a matrix was generated by applying the
SVD technique. In the third step, which the authors called the
topic extraction step, the topics were selected based on the
interest. Finally, the results were computed based on topic
recall, keyword precision, and keyword recall parameters.
Authors in [25] presented and discussed different methods and
approaches used in the keyword extraction task. They also
proposed a graph based keyword extraction method for the
Croatian language which is based on extraction of nodes. The
authors used selectivity-based keyword extraction method in
which text is represented in the form of vertex and edges. The
result is computed on the in-degree, out-degree, closeness and
selectivity. Authors in [26] developed a keyword extraction
method that represents text with a graph, applies the centrality
measure and finds the relevant vertices. Authors proposed a
three-step based technique called TKG (Twitter Keyword
Graph). The first step was the pre-processing in which stop
words were removed. In the second step, a graph was
developed in which nearest neighbor and all neighbors were
considered. Finally, the results were computed based on the
precision, recall, F-measure test scores and graph scalability.
Authors in [27] proposed an information summarization
method for the large quantum of information which is
disseminated everyday through tweets. Their method collects
tweets using a specific keyword and then, summarizes them to
find out the topics. The authors provide two algorithms: Topic
extraction using AGF (TDA) and topic clustering and tweet
retrieval (TCTR). The methodology first extracts tweets from
twitter and then applies the Tf-Idf technique to find out weights
and word frequency. The AGF is evaluated using keyword
rating. Finally, the results are calculated based on the class
entropy, purity, and cluster entropy. Authors in [28] proposed a
technique in which a user can search using a search engine but
without entering any keywords. The google similarity distance
technique is used to find the keywords. A log is maintained in
which user behavior and repository is saved. So, the need for
the repository is abolished and everything is done online and in
real time. Keyword expansion and extraction methods are used
to extract relevant and accurate information. In keyword
expansion, help is provided to user to enter the exact keyword
and to get the exact information. In keyword extraction, the
word is analyzed based on the occurrence on the length and
frequency. Keyword extraction method relies on statistical
approaches and machine learning approaches. The proposed
methodology of the authors is composed of three parts: 1-g
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2592 Shaikh : Keyword Detection Techniques: A Comprehensive Study
filtering, google similarity distance calculation, and search
results filtering. Finally, the results are calculated based on the
parameters of precision and recall. The relationship between
top k results is evaluated. Thus, the authors proposed a system
in which user just needs to browse the web page and the
relevant keywords are generated. The system suits well for the
science stream as the words are clear but may not be accurate
for the social science.
Authors in [29] produced a facility based on Bayesian text
classification approach called high relevance keyword
extraction (HRKE) to extract the keywords at the stage of
classification without the use of pre-classification process. The
facility uses a posterior probability value to extract keywords.
The HRKE first extracts the words from the text. Next, the
posterior probability is calculated. Finally, the Tf-Idf method is
used to assign weights to words. Authors claim that the HRKE
facility improves the performance and accuracy of the Bayesian
classifier and reduces time consumption. The experiment was
conducted on three dataset-featured article datasets. In the end,
the corresponding threshold and accuracy graph is plotted.
Authors in [30] address the problem of part-of-speech (POS)
tagging from the richer text of twitter. Authors developed a
POS tagset first. Secondly, they performed manual tagging on
the dataset. Afterwards, the features for the POS tagger were
developed. Finally, the experiments were conducted to develop
the annotated dataset for the research community. The
hashtags, URLs, and emotions were considered. The results
were obtained with 90 percent accuracy. Authors concluded
claiming that the approach can be applied to linguistic analysis
of social media, and the annotated data can be used in semi
supervised learning. Authors in [31] gave a solution to the
problem of statistical keyword extraction from the text by
adapting entropic and clustering approaches. Authors made
changes in these approaches and proposed a new technique
which detects keywords as per user’s needs. The main
objective of the authors was to find and rank important words
in the text. The two approaches were applied on short texts
(such as web pages, articles, glossary terms, generic short text
etc.) and long texts (such as books, periodicals etc.). Results
were evaluated and the clustering approach proved to be better
for both cases, while the entropic approach suited well for the
long text and did not perform well for the partitioned text.
Authors in [32] proposed a metric called entropy difference
(ED) for the ranking of the words on a Chinese dataset.
Authors used Shannon’s entropy method which is the
difference between intrinsic and extrinsic modes. The idea of
intrinsic and extrinsic modes is that meaningful words are
grouped together. Therefore, the words are extracted and
ranked according to the entropy difference. Authors calculated
mean, mode and median on entropy differences. Their ED
metric proved to be a good choice in word ranking. The
method differentiates between the words that define authors’
purpose and the irrelevant words which are present randomly in
the text. This method is well suited for single document of
which no information is known in advance. Authors in [33]
provided a solution to the inherent noisy and short nature tweet
problem of Twitter streams called HybridSeg. Authors
incorporated local context knowledge of the tweets with global
knowledge bases for better tweet segmentation. The tweet
segmentation process was performed on two tweet datasets.
The tweets were split into segments to extract meaning of the
information conveyed through the tweet. Results show that
HybridSeg significantly improved tweet segmentation quality
compared with other traditional approaches. Authors claim that
the segment based entity is better than word based entity.
Author in [34] provided a unique solution to the keyword
extraction problem called ConceptExtractor. The
ConceptExtractor do not decide on the relevance of a term
during the extraction phase, instead, it only extracts generic
concepts from texts and postpones the decision about relevant
terms based on the needs of the downstream applications.
Authors claim that unlike other statistical extractors,
ConceptExtractor can identify single-word and multi-word
expressions using the same methodology. Results were
evaluated based on three languages. Precision and recall were
used for the result evaluation. Authors also defined a metric to
specificity both single and multi-word expressions usable in
other languages. Authors in [35] considered various Chinese
keyword extraction methods. In this paper, extended Tf
approach has been defined which considers Chinese
characteristics with Tf method. Authors also developed a
classification model based on support vector machine (SVM)
algorithm. Many improvement strategies were defined and four
experiments were performed to evaluate the results. Results
showed that SVM optimized the keywords. Precision and recall
rate improved much better. Authors concluded that the
improved Tf method is much better than the traditional Tf
method in terms of accuracy and precision. Authors in [36]
discovered and classified terms that are either document title or
‘title-like’. Their idea was that the terms that are title or title
like should behave in the same way in a document. The
classifier was trained using distributional and linguistic features
to find the behavior of the terms. Different features were
considered such as location, frequency, document size etc. The
rating was calculated on the basis of topical, thematic and title
terms. After this the evaluation was performed based on recall
and precision. The recall rate of finding the title terms was high
but the precision rate was low because some of the words
which were not titles were also identified in title terms. Authors
in [37] developed a sensitive text analysis for extracting task-
oriented information from unstructured biological text sources
using a combination of natural language, dynamic
programming techniques and text classification methods. Using
computable functions, the model finds out matching sequences,
identifies effects of various factors and handles complex
information sequences. Authors pre-processed the text contents
and applied them with entity tagging component to find out the
causes of diseases related to low-quality food. Results show
that the bottom-up scanning of key-value pairs improves
content finding which can be used to generate relevant
sequences to the testing task. The method improves
information retrieval accuracy in biological text analysis and
reporting applications.
Table I provides inferences drawn from modern keyword
detection techniques, their advantages and disadvantages over
previous studies, and result analysis.
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2593 Shaikh : Keyword Detection Techniques: A Comprehensive Study
Paper Techniques used Advantages Disadvantages Results / Analysis
[20] a) Graph model
b) Semantic space
a) Can detect the words which are
wrongly segmented.
b) Extracts keywords from a micro
a) Not suitable for large texts.
b) Some terms will not be
Best performance obtained is
[21] a) OSLOM algorithm
b) Page rank algorithm
a) Able to identify the topics of twitter
b) Less expensive.
Not able to identify the events
based on graph clusters. Best result obtained from
structured based approach.
[22] a) Brown clustering
b) Continuous word
a) Improved state of the art for keyword
b) Automatically keyword extraction.
Not suitable for Facebook text
keyword extraction. Accuracy for precision
obtained is 72.05, recall 75.16.
[24] TOPOL a) Suitable for noisy data.
b) Reduces computation time and
improves topic extraction result.
Suffers from data
fragmentation. The result obtained is 0.5380
for recall,0.7500 for precision.
[26] a) Tf-Idf
b) KEA
c) Proposed TKG
a) TKG proved to be robust and
superior compared to other approaches
b) TKG is simpler to use than KEA
The best configuration of TKG
was not found TKG results better compared to
KEA and Tf-Idf.
[28] a) Statistics approach
b) Machine learning
a) Search engine which can
automatically extract important
b) System works well
Not suitable for business
management domain High recall rate
[29] Bayesian approach
a) Low cost, simple and efficient
b) Handles raw data without text
a) Presence of noisy data may
degrade the performance.
b) Feature selection method
degrades the efficiency of
classification task.
Improved accuracy
[31] a) Entropic
b) Clustering approach
a) Suitable for both long and short
b) Reliable obtained results.
Median and mode did not give
the correct result. Good clustering results for
both short and long texts.
[32] Shannon entropy
a) Suitable for text with no information
known in advance.
b) Easy to numerically implement.
Median and mode did not give
the correct result. Better results for single
[33] Hybrid segmentation High quality tweet segmentation. Manual segmentation is
expensive. Improved precision.
[34] Statistical language
independent Good for extracting single and multi-
word expressions. Not suitable for long text Improved precision and recall.
[36] a) Decision tree
b) Pattern recognition Easy title determination
a) Not easy to determine the
best document size.
b) Precision was not significant
than recall.
Recall 85% was achieved for
title like terms.
[37] a) Sensitive text analysis
b) Context-based
extraction method
a) Category-oriented approach for
extraction of task-specific information
b) Investigations into recall and
precision were carried out.
Not tested on generic data.
a) Food safety is analyzed to
prevent future consequences.
b) Improved classification
accuracy by utilizing
optimization constraints.
c) Causes of diseases related to
low-quality food were
This paper extends understanding of widely used `existing
approaches to keyword detection in the identification of
influential segments from a large amount of textual data or
documents. Therefore, extant literature on existing traditional
TDT approaches to automatic identification of important words
was reviewed and discussed. Techniques reviewed include
collocation, word co-occurrence networks, topic modelling and
other machine learning approaches. Results show that the
majority of these techniques is domain dependent and language
dependent. It was observed that although traditional keyword
extraction techniques have been performing satisfactorily, a
need exists to propose unsupervised, domain independent and
language independent techniques which use statistically
computational methods. Keyword extraction task has been
widely explored, but there is still a large scope and gap for
identifying topics from the uncertain user-generated data.
[1] E. Landhuis, “Neuroscience: Big brain, big data”, Nature, Vol. 541, No.
7638, pp. 559-561, 2017
[2] G. Ercan, I. Cicekli, “Using lexical chains for keyword extraction”,
Information Processing & Management, Vol. 43, No. 6, pp. 1705-1714,
[3] R. S. Ramya, K. R. Venugopal, S. S. Iyengar, L. M. Patnaik, “Feature
extraction and duplicate detection for text mining: A survey”, Global
Journal of Computer Science and Technology, Vol. 16, No. 5, pp. 1-20,
Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2590-2594 2594 Shaikh : Keyword Detection Techniques: A Comprehensive Study
[4] J. Allan, J. G. Carbonell, G. Doddington, J. Yamron, Y. Yang, Topic
detection and tracking pilot study final report, DARPA Broadcast News
Transcription and Understanding Workshop, 1998
[5] P. Eckersley, G. F. Egan, S. Amari, F. Beltrame, R. Bennett, J. G.
Bjaalie,T. Dalkara, E. De Schutter, C. Gonzalez, S. Grillner, A. Herz, K.
P. Hoffmann, I. P. Jaaskelainen, S. H. Koslow, S.-Y. Lee, L.
Matthiessen, P. L. Miller, F. M. da Silva, M. Novak,V. Ravindranath, R.
Ritz, U. Ruotsalainen, S. Subramaniam, A. W.Toga, S. Usui, J. van Pelt,
P. Verschure, D. Willshaw, A. Wrobel, Tang Yiyuan, “Neuroscience
data and tool sharing”, Neuroinformatics, Vol. 1, No. 2, pp. 149-165,
[6] D. Kuttiyapillai, R. Rajeswari, “Insight into information extraction
method using natural language processing technique”, International
Journal of Computer Science and Mobile Applications, Vol. 1, No. 5,
pp. 97-109, 2013
[7] S. Rose, D. Engel, N. Cramer, W. Cowley, Automatic keyword
extraction from individual documents, Text Mining: Applications and
Theory, John Wiley & Sons, 2010
[8] J. Wu, S. R. Choudhury, A. Chiatti, C. Liang, C. L. Giles, “HESDK: A
hybrid approach to extracting scientific domain knowledge entities”, In
ACM/IEEE Joint Conference on Digital Libraries, pp. 1-4, 2017
[9] D. B. Bracewell, F. Ren, S. Kuriowa, “Multilingual single document
keyword extraction for information retrieval”, IEEE International
Conference on Natural Language Processing and Knowledge
Engineering, pp. 517-522, 2005
[10] D. Kuttiyapillai, R. Rajeswari, “Extended text feature classification with
information extraction”, International Journal of Applied Engineering
Research, Vol. 10, No. 29, pp. 22671-22676, 2015
[11] S. C. Watkins, The young and the digital: What the migration to social-
network sites, games, and anytime, anywhere media means for our
future, Beacon Press, 2009
[12] I. M. Soboroff, D. P. McCullough, J. Lin, C. Macdonald, I. Ounis, R.
McCreadie, “Evaluating real-time search over tweets”, International
Conference on Weblogs and Social Media, pp. 943-961, 2012
[13] H. L. Yang, A. F. Chao, “Sentiment analysis for Chinese reviews of
movies in multi-genre based on morpheme-based features and
collocations”, Information Systems Frontiers, Vol. 17, No. 6, pp. 1335-
1352, 2015
[14] J. Yang, J. Leskovec, “Patterns of temporal variation in online media”,
4rth ACM international conference on Web search and data mining, pp.
177-186, 2011
[15] D. M. Blei, A. Y. Ng, M. I. Jordan, “Latent dirichlet allocation”, Journal
of Machine Learning Research, Vol. 3, No. Jan, pp. 993-1022, 2003
[16] D. M. Blei, J. D. Lafferty, “Dynamic topic models”, 23rd international
conference on Machine learning, pp. 113-120, 2006
[17] M. Habibi, A. Popescu-Belis, “Keyword extraction and clustering for
document recommendation in conversations”, IEEE/ACM Transactions
on Audio, Speech, and Language Processing, Vol. 23, No. 4, pp. 746-
759, 2015
[18] S. Beliga, Keyword extraction: A review of methods and approaches,
University of Rijeka, Department of Informatics, 2014
[19] S. Usui, P. Palmes, K. Nagata, T. Taniguchi, N. Ueda, “Keyword
extraction, ranking, and organization for the neuroinformatics platform”,
Biosystems, Vol. 88, No. 3, pp. 334-342, 2007
[20] H. Zhao, Q. Zeng, “Micro-blog keyword extraction method based on
graph model and semantic space”, Journal of Multimedia, Vol. 8, No. 5,
pp. 611-617, 2013
[21] H. Hromic, N. Prangnawarat, I. Hulpus, M. Karnstedt, C. Hayes,
“Graph-based methods for clustering topics of interest in Twitter”,
International Conference on Web Engineering, pp. 701-704, Springer,
[22] L. Marujo, W. Ling, I. Trancoso, C. Dyer, A. W. Black, A. Gershman,
D. M. de Matos, J. P. Neto, J. G. Carbonell, “Automatic keyword
extraction on Twitter”, ACL (2), pp. 637-643, 2015
[23] D. Kim, D. Kim, S. Rho, E. Hwang, “Detecting trend and bursty
keywords using characteristics of Twitter stream data”, International
Journal of Smart Home, Vol. 7, No. 1, pp. 209-220, 2013
[24] P. Torres-Tramon, H. Hromic, B. R. Heravi, “Topic detection in Twitter
using topology data analysis”, International Conference on Web
Engineering, pp. 186-197, 2015
[25] S. Beliga, A. Mestrovic, S. Martincic-Ipsic, “An overview of graph-
based keyword extraction methods and approaches”, Journal of
Information and Organizational Sciences, Vol. 39, No. 1, pp. 1-20, 2015
[26] W. D. Abilhoa, L. N. De Castro, “A keyword extraction method from
Twitter messages represented as graphs”, Applied Mathematics and
Computation, Vol. 240, pp. 308-325, 2014
[27] A. Benny, M. Philip, “Keyword based tweet extraction and detection of
related topics”, Procedia Computer Science, Vol. 46, pp. 364-371, 2015
[28] W. Chung, H. Chen, J. F. Nunamaker Jr, “A visual framework for
knowledge discovery on the web: An empirical study of business
intelligence exploration”, Journal of Management Information Systems,
Vol. 21, No. 4, pp. 57-84, 2005
[29] D. Isa, L. H. Lee, V. P. Kallimani, R. Rajkumar, “Text document
preprocessing with the bayes formula for classification using the support
vector machine”, IEEE Transactions on Knowledge and Data
engineering, Vol. 20, No. 9, pp. 1264-1272, 2008
[30] K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein,
M. Heilman, D. Yogatama, J. Flanigan, N. A. Smith, “Part-of-speech
tagging for Twitter: Annotation, features, and experiments”, 49th
Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies: short papers, Vol. 2, pp. 42-47, 2011
[31] P. Carpena, P. A. Bernaola-Galvan, C. Carretero-Campos, A. V.
Coronado, “Probability distribution of intersymbol distances in random
symbolic sequences: Applications to improving detection of keywords in
texts and of amino acid clustering in proteins”, Physical Review E, Vol.
94, No. 5, pp. 052302, 2016
[32] Z. Yang, K. Gao, K. Fan, Y. Lai, “Sensational headline identification by
normalized cross entropy-based metric”, The Computer Journal, Vol. 58,
No. 4, pp. 644-655, 2014
[33] C. Li, A. Sun, J. Weng, Q. He, “Exploiting hybrid contexts for tweet
segmentation”, 36th International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 523–532, 2013
[34] J. M. J. Ventura, Automatic extraction of concepts from texts and
applications, Diss. Universidade Nova de Lisboa, 2014
[35] B. Hong, D. Zhen, “An extended keyword extraction method”, Physics
Procedia, Vol. 24B, pp. 1120-1127, 2012
[36] C. W. Wong, R. W. Luk, E. K. Ho, “Discovering ‘title-like2 terms”,
Information Processing & Management, Vol. 41, No. 4, pp. 789–800,
[37] D. Kuttiyapillai, R. Rajeswari, “A method for extracting task-oriented
information from biological text sources”, International Journal of Data
Mining and Bioinformatics, Vol. 12, No. 4, pp. 387-399, 2015
Dr. Zaffar Ahmed Shaikh received his PhD in Computer Science from the
Institute of Business Administration, Karachi (IBA-Karachi) in 2017. He is
currently working as an Assistant Professor at Benazir Bhutto Shaheed
University, Lyari, Karachi, Pakistan. He has twenty-three research
publications to his credit and has received several research grants from EPFL
(Switzerland), Higher Education Commission (Pakistan), Ministry of Higher
Education (KSA) and IBA-Karachi. His research interests include Data
Sciences, Knowledge Management, Language & Technology, Learning
Environments, MOOCs, Social Software, Technology Enhanced Learning etc.
Dr. Shaikh is a professional member of ACM and IEEE.
... This can be achieved by using natural language processing-based keyword extraction techniques. The methodologies comprise word frequency-based technique TF-IDF, linguistically semantic keywords extraction, word collocation techniques, machine learning classifiers, and clustering approaches (Prentice et al., 2012;Shaikh, 2018). Thilagam et al. used an iterative process of keyword generation for textual data pertinent to crime from newspaper headlines by analyzing semantic similarity using WordNet (Miller et al., 1990) word dictionary, which provided synonyms, hyponyms and meronyms (Srinivasa & Thilagam, 2019). ...
... The techniques being unsupervised remove the overhead of annotating data for training purpose. The keyword extraction process can be enhanced by topic modeling by reducing the volume of textual data to a smaller and more relevant set (Shaikh, 2018). Srinivasa and Thilagam (2019) used LDA to extract data pertinent to crime from newspapers headlines and extracted keywords from the resultant set. ...
Full-text available
In this contemporary era, where a large part of the world population is deluged by extensive use of the internet and social media, terrorists have found it a potential opportunity to execute their vicious plans. They have got a befitting medium to reach out to their targets to spread propaganda, disseminate training content, operate virtually, and further their goals. To restrain such activities, information over the internet in context of terrorism needs to be analyzed to channel it to appropriate measures in combating terrorism. Open Source Intelligence (OSINT) accounts for a felicitous solution to this problem, which is an emerging discipline of leveraging publicly accessible sources of information over the internet by effectively utilizing it to extract intelligence. The process of OSINT extraction is broadly observed to be in three phases (i) Data Acquisition, (ii) Data Enrichment, and (iii) Knowledge Inference. In the context of terrorism, researchers have given noticeable contributions in compliance with these three phases. However, a comprehensive review that delineates these research contributions into an integrated workflow of intelligence extraction has not been found. The paper presents the most current review in OSINT, reflecting how the various state‐of‐the‐art tools and techniques can be applied in extracting terrorism‐related textual information from publicly accessible sources. Various data mining and text analysis‐based techniques, that is, natural language processing, machine learning, and deep learning have been reviewed to extract and evaluate textual data. Additionally, towards the end of the paper, we discuss challenges and gaps observed in different phases of OSINT extraction. This article is categorized under: Application Areas > Government and Public Sector Commercial, Legal, and Ethical Issues > Social Considerations Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining
... Topic extraction (TE), also known as cluster labeling, category labeling, keyword extraction, topic finding, and topic discovery is beneficial in a wide range of real world applications (Shaikh, 2018). For example, when current chemical engineering publications are analyzed, topics that are becoming highly prevalent can be found; their patterns and popularity can be anticipated (Zhang et al., 2016). ...
The continually developing Internet generates a considerable amount of text data. When attempting to extract general topics or themes from a massive corpus of documents, dealing with such a large volume of text data in an unstructured format is a big problem. Text document clustering (TDC) is a technique for grouping texts based on their content similarity. Partitioning text collection based on the documents’ content significance is one of the most challenging tasks at TDC. This study proposes the Bare-Bones Based Salp Swarm Algorithm (BBSSA) to solve the problem of TDC. In addition, to extract the topics from the clusters, an ensemble approach for automatic topic extraction (TE) is proposed. The proposed BBSSA and the ensemble TE approach are tested using six standard benchmarks and six scientific publishing datasets from top QS ranking UAE universities. BBSSA’s findings are compared with sixteen well-known techniques, including eleven metaheuristic algorithms, such as the Whale Optimization Algorithm (WOA), Firefly Algorithm (FFA), Bat Algorithm (BAT), Harmony Search (HS), Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Multi-Verse Optimizer (MVO), Grey Wolf Optimizer (GWO), Moth-Flame Optimization (MFO), Krill Herd Algorithm (KHA), SSA, and five clustering methods, such as K-means++, K-means, Density-based Spatial Clustering of Applications with Noise (DBSCAN), Spectral, and Agglomerative. The results of the ensemble TE approach are compared with those of seven well-known statistical methods, including Mutual Information (MI), TextRank (TR), Co-Occurrence Statistical Information-based Keyword Extraction (CSI), Term Frequency-Inverse Document Frequency (TF-IDF), most frequent based keyword extraction (TF), YAKE!, and RAKE. According to the experiments, the BBSSA outperforms all other approaches and is exceedingly competitive. The results also reveal that for most datasets, the proposed ensemble TE strategy outperforms all existing TE methods based on external metrics. Thus, the ensemble TE approach can be seen as a supplement to the other methods.
... Two European countries (Sweden and Switzerland) are examined in detail in the study. The main macroeconomic indicators of these countries are similar [30]- [36]. ...
Full-text available
The article determines factors affecting the demand for Central Bank Digital Currency (CBDC) introduction. The research paper identifies these key parameters to be similar to those contributing to cash demand. Moreover, a comparative analysis between Sweden and Switzerland is presented based on these factors to highlight differences in national economies, which play a significant role in the subsequent demand for CBDC. The timeframe of this analysis is mainly based on the period of the recent COVID-19 pandemic, which brought with itself a global economic crisis. These difficult macroeconomic conditions emphasize the characteristics of national economies that may be evidence of the need of CBDC issuance.
... We provide model training knowledge, then the machine will predict that it is a circle or square. This supervised learning method is used in the classification and regression of subcategories [12]- [18]. Unsupervised learning works on unstructured or hidden data where we provide the machine with different data. ...
Full-text available
The rapid advancement in Information Technology (IT) and the development of Artificial Intelligence (AI) makes systems more efficient and effective in performing day-today tasks, such as identification, extraction, detection, and recognition-related problems. These pose a serious indication towards the concept of Machine learning (ML) and the proposed efficient techniques for distinct purposes, which are either performed artificially supervised, semi-supervised or unsupervised ML. However, the ML systems have the potential to self-learn and adapt themselves by explicitly programmed existence based on earlier experience. ML and data mining approaches are justified in this research. It is featured in the Principal Component Analysis (PCA) method of supervised and unsupervised learning. Both machine learning algorithm approaches use a variety of methodologies. For this purpose, we use the same datasets for classification, regression, and clustering procedures, as well as the PCA data mining technique, to evaluate and perform on the K-Means, K-nearest neighbor (KNN), and support vector machine (SVM) algorithms. Finally, we present simulations that show the estimated criteria, parameters, and efficiency with effective algorithms' performance, and state-of-the-art results.
... Our system is a client-server application with a ReactJS [23] front and a Flask [24] back end. The Fraud detection module was developed using the open-source deep learning framework PyTorch [25][26][27][28][29][30][31]. The Detection module is deployed on the server where the input data is prepossessed and sent to the front end for visualization. ...
Full-text available
Mobile money transfer systems (MMTS) in countries with limited banking are increasingly becoming the mainstream banking system. The analysis of transactions performed on these systems helps to detect fraudulent and criminal activities. This paper introduces a novel visual analytics framework for visual analysis and exploration of mobile money transactions. Our system enables the empirical analysis of mobile money transactions data using multiple views to reveal the temporal, geospatial, and categorical aspects of the transactions. Several challenges were identified related to the given MMT datasets through the process of implementing of fraud detection framework. In addition, as a step towards recognizing the difficult task of developing versatile and flexible fraud detection models, this work proposes concepts to address the identified challenges.
... Precision amounts to the fraction of images correctly labeled as belonging to the positive class divided by the total number of images labeled as belonging to the positive class by the model [43,44] as follows: ...
Full-text available
Skin cancer is one of the most prevalent and deadly types of cancer. Dermatologists diagnose this disease primarily visually. Multiclass skin cancer classification is challenging due to the fine-grained variability in the appearance of its various diagnostic categories. On the other hand, recent studies have demonstrated that convolutional neural networks outperform dermatologists in multiclass skin cancer classification. We developed a preprocessing image pipeline for this work. We removed hairs from the images, augmented the dataset, and resized the imageries to meet the requirements of each model. By performing transfer learning on pre-trained ImageNet weights and fine-tuning the Convolutional Neural Networks, we trained the EfficientNets B0-B7 on the HAM10000 dataset. We evaluated the performance of all EfficientNet variants on this imbalanced multiclass classification task using metrics such as Precision, Recall, Accuracy, F1 Score, and Confusion Matrices to determine the effect of transfer learning with fine-tuning. This article presents the classification scores for each class as Confusion Matrices for all eight models. Our best model, the EfficientNet B4, achieved an F1 Score of 87 percent and a Top-1 Accuracy of 87.91 percent. We evaluated EfficientNet classifiers using metrics that take the high-class imbalance into account. Our findings indicate that increased model complexity does not always imply improved classification performance. The best performance arose with intermediate complexity models, such as EfficientNet B4 and B5. The high classification scores resulted from many factors such as resolution scaling, data enhancement, noise removal, successful transfer learning of ImageNet weights, and fine-tuning [71, 72, 73]. Another discovery was that certain classes of skin cancer worked better at generalization than others using Confusion Matrices.
... However, there is uncertainty regarding defining these topics, and there is an ongoing debate about how to extract them automatically. Besides, extracting these topics using manual methods is slow, expensive, and bristling with mistakes [88]. One of the most commonly used techniques to identify topics is to cluster documents to identify specific sets of scholarly papers, representing a related subject matter [96]. ...
Full-text available
The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters’ contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the first step of topic identification to identify the documents, which address a related subject matter. Metaheuristics are typically used as efficient approaches for TDC. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. In the TE process, the focus of each statistical TE method is placed on various language feature space aspects. The aim of this paper is to design a novel ensemble method for an automatic TE from a collection of scientific publications based on MVO as the clustering algorithm. The automatic TE, which is used in our approach, is term frequency-inverse document frequency (TF-IDF), most frequent based keyword extraction (TF), co-occurrence statistical information-based keyword extraction (CSI), TextRank (TR), and mutual information (MI). A group of candidate topics can be provided by each automatic TE method for the proposed ensemble method. Next, the ensemble approach prunes the candidate topics’ set via the application of a specific filtering heuristic. Then, their scores are recalculated based on the prescribed metrics. After that, for selecting a set of topics for certain scientific publications, dynamic threshold functions are applied. The findings emphasized the refined candidate set’s efficiency, as well as effectiveness. The results also showed that the system’s quality has been improved by new topics. The proposed method achieved better precision, as well as recall on a similar dataset compared to the state-of-the-art TE methods.
... The system extracts interesting terms for QE from the top-three papers keeping in view both TF-IDF and citation score. Interesting terms are the ones having the highest TF-IDF, which is the most widely used technique for keyword extraction [32]. The system runs the revised query in the background for generating the final results list. ...
Full-text available
Published scholarly articles have increased exponentially in recent years. This growth has brought challenges for academic researchers in locating the most relevant papers in their fields of interest. The reasons for this vary. There is the fundamental problem of synonymy and polysemy, the query terms might be too short, thus making it difficult to distinguish between papers. Also, a new researcher has limited knowledge and often is not sure about what she is looking for until the results are displayed. These issues obstruct scholarly retrieval systems in locating highly relevant publications for a given search query. Researchers seek to tackle these issues. However, the user's intent cannot be addressed entirely by introducing a direct information retrieval technique. In this paper, a novel approach is proposed, which combines query expansion and citation analysis for supporting the scholarly search. It is a two-stage academic search process. Upon receiving the initial search query, in the first stage, the retrieval system provides a ranked list of results. In the second stage, the highest-scoring Term Frequency-Inverse Document Frequency (TF-IDF) terms are obtained from a few top-ranked papers for query expansion behind the scene. In both stages, citation analysis is used in further refining the quality of the academic search. The originality of the approach lies in the combined exploitation of both query expansion by pseudo relevance feedback and citation networks analysis that may bring the most relevant papers to the top of the search results list. The approach is evaluated on the ACL dataset. The experimental results reveal that the technique is effective and robust for locating relevant papers regarding normalized Discounted Cumulative Gain (nDCG), precision, and recall.
... However, there is a lot of uncertainty regarding how to define these topics, and there is an ongoing debate about how to automatically extract them. In addition, extracting these topics using manual methods is slow, expensive, and bristling with mistakes (Shaikh 2018). One of the most commonly used techniques to identify topics is to cluster documents with the aim of identifying specific sets of scholarly papers, which represent a related subject matter . ...
Full-text available
For text document clustering (TDC), a novel hybrid of the multi-verse optimizer (MVO) algorithm and k-means (also called H-MVO) are proposed in this work. Moreover, a new ensemble method for an automatic topic extraction (TE) has been proposed in this paper, from a set of scientific publications in the form of text documents with the purpose of extracting topics from clustered documents. Often, the existing TE methods draw upon the statistical theory. However, the results might be different when the same clustered document is utilized. Consequently, there can be imprecise results, which are related to the extracted topics from the clustered documents owing to the behavior of the TE methods. As a result, the vigorous characteristics of the TE methods are ensembled, thereby empowering the accuracy of the extracted topics. The results, which were yielded by H-MVO for TDC, were compared against 14 well-regarded methods, involving five clustering methods, in addition to seven metaheuristic algorithms, as well as two hybrid optimization algorithms. Also, the results, which were generated by the introduced ensembled TE method, were compared against those, which were produced by five established statistical methods in the literature. As a result, the findings revealed that the suggested ensembled TE method outperformed the entire comparative methods, thereby utilizing all the external measurements for almost the entire datasets. Moreover, the new method can complement the advantages of the five previously proposed methods. Accordingly, more advanced results were obtained.
Full-text available
Many researchers have examined the risks imposed by the Internet of Things (IoT) devices on big companies and smart towns. Due to the high adoption of IoT, their character, inherent mobility, and standardization limitations, smart mechanisms, capable of automatically detecting suspicious movement on IoT devices connected to the local networks are needed. With the increase of IoT devices connected through internet, the capacity of web traffic increased. Due to this change, attack detection through common methods and old data processing techniques is now obsolete. Detection of attacks in IoT and detecting malicious traffic in the early stages is a very challenging problem due to the increase in the size of network traffic. In this paper, a framework is recommended for the detection of malicious network traffic. The framework uses three popular classification-based malicious network traffic detection methods, namely Support Vector Machine (SVM), Gradient Boosted Decision Trees (GBDT), and Random Forest (RF), with RF supervised machine learning algorithm achieving far better accuracy (85.34%). The dataset NSL KDD was used in the recommended framework and the performances in terms of training, predicting time, specificity, and accuracy were compared.
Full-text available
Symbolic sequences have been extensively investigated in the past few years within the framework of statistical physics. Paradigmatic examples of such sequences are written texts, and deoxyribonucleic acid (DNA) and protein sequences. In these examples, the spatial distribution of a given symbol (a word, a DNA motif, an amino acid) is a key property usually related to the symbol importance in the sequence: The more uneven and far from random the symbol distribution, the higher the relevance of the symbol to the sequence. Thus, many techniques of analysis measure in some way the deviation of the symbol spatial distribution with respect to the random expectation. The problem is then to know the spatial distribution corresponding to randomness, which is typically considered to be either the geometric or the exponential distribution. However, these distributions are only valid for very large symbolic sequences and for many occurrences of the analyzed symbol. Here, we obtain analytically the exact, randomly expected spatial distribution valid for any sequence length and any symbol frequency, and we study its main properties. The knowledge of the distribution allows us to define a measure able to properly quantify the deviation from randomness of the symbol distribution, especially for short sequences and low symbol frequency. We apply the measure to the problem of keyword detection in written texts and to study amino acid clustering in protein sequences. In texts, we show how the results improve with respect to previous methods when short texts are analyzed. In proteins, which are typically short, we show how the measure quantifies unambiguously the amino acid clustering and characterize its spatial distribution.
Full-text available
In this paper, we build a corpus of tweets from Twitter annotated with keywords using crowdsourcing methods. We identify key differences between this domain and the work performed on other domains, such as news, which makes existing approaches for automatic keyword extraction not generalize well on Twitter datasets. These datasets include the small amount of content in each tweet, the frequent usage of lexical variants and the high variance of the cardinality of keywords present in each tweet. We propose methods for addressing these issues, which leads to solid improvements on this dataset for this task.
Full-text available
The paper surveys methods and approaches for the task of keyword extraction. The systematic review of methods was gathered which resulted in a comprehensive review of existing approaches. Work related to keyword extraction is elaborated for supervised and unsupervised methods, with a special emphasis on graph-based methods. Various graph-based methods are analyzed and compared. The paper provides guidelines for future research plans and encourages the development of new graph-based approaches for keyword extraction.
Neuroscientists are starting to share and integrate data [mdash] but shifting to a team approach isn't easy.
Conference Paper
The massive volume of content generated by social media greatly exceeds human capacity to manually process this data in order to identify topics of interest. As a solution, various automated topic detection approaches have been proposed, most of which are based on document clustering and burst detection. These approaches normally represent textual features in standard n-dimensional Euclidean metric spaces. However, in these cases, directly filtering noisy documents is challenging for topic detection. Instead we propose Topol, a topic detection method based on Topology Data Analysis (TDA) that transforms the Euclidean feature space into a topological space where the shapes of noisy irrelevant documents are much easier to distinguish from topically-relevant documents. This topological space is organised in a network according to the connectivity of the points, i.e. the documents, and by only filtering based on the size of the connected components we obtain competitive results compared to other state of the art topic detection methods.
Conference Paper
Online Social Media provides real-time information about events and news in the physical world. A challenging problem is then to identify in a timely manner the few relevant bits of information in these massive and fast-paced streams. Most of the current topic clustering and event detection methods focus on user generated content, hence they are sensible to language, writing style and are usually expensive to compute. Instead, our approach focuses on mining the structure of the graph generated by the interactions between users. Our hypothesis is that bursts in user interest for particular topics and events are reflected by corresponding changes in the structure of the discussion dynamics. We show that our method is capable of effectively identifying event topics in Twitter ground truth data, while offering better overall performance than a purely content-based method based on LDA topic models.
There have been many domain-specific keyword extraction researches, but micro-blog- oriented keyword extraction is just beginning. This paper researches into the keyword extraction from Chinese micro-blog. Taking the characteristics of micro-blog into account, such as short, topic divergence, etc., we propose a Chinese micro-blog keyword extraction method based on the combination of multi features. Firstly create the graph model based on the co-occurrence between words, get a kind of weight based on the created graph model. The weight based on the graph model is sometimes same. In order to solve this problem, this method secondly proposes to create the semantic space based on the topic detection method, and get the statistical weight based on the semantic space. Finally, we take the location of words into account during the extraction, which is proved to be a very effective feature. Experimental results show that the proposed keyword extraction method is very successful.
For modeling different datasets, classification is used in many existing works. However, a very few study related to terrorist tracking in sensitive area is found in the literature. To support effective detection and identification of terrorist and their activities, modeling GPS data and RFID data is necessary. But, poor representation of features leads to low classification accuracy. In this study, a classification method is studied for analyzing the behavior of terrorist in sensitive area, based on location, order of the visit, usage pattern of devices, and other activities. This approach deals with finding relevant sequences of information that are considered as good features, satisfying order-preserving property. Furthermore, to gain better computational efficiency, the length of the task-sequence is confined to a maximum length. Unlike closed sequence, the partial or length based sequence can significantly improve computational efficiency with no loss of accuracy. Proposed framework for relevant, partial sequence–based classification gives better accuracy than other approaches in real dataset.
A method for information extraction which processes the unstructured data from document collection has been introduced. A dynamic programming technique adopted to find relevant genes from sequences which are longest and accurate is used for finding matching sequences and identifying effects of various factors. The proposed method could handle complex information sequences which give different meanings in different situations, eliminating irrelevant information. The text contents were pre-processed using a general-purpose method and were applied with entity tagging component. The bottom-up scanning of key-value pairs improves content finding to generate relevant sequences to the testing task. This paper highlights context-based extraction method for extracting food safety information, which is identified from articles, guideline documents and laboratory results. The graphical disease model verifies weak component through utilisation of development data set. This improves the accuracy of information retrieval in biological text analysis and reporting applications.