Fig 3 - uploaded by Thushara Mg
Content may be subject to copyright.
A Comparative Study on keyword extraction algorithms

A Comparative Study on keyword extraction algorithms

Source publication
Conference Paper
Full-text available
Growth in the number of research documents getting published is increasing. Searching a research document under interested domain by referring the whole paper has become a tedious task. Keywords, Key-phrases gives the summary of the text.

Similar publications

Article
Full-text available
Summarization of data is called "information" but to get the information is somewhat difficult from raw data. This is the task where humans also face issues to get the correct one and so in the era of computers, we need a machine which can easily perform this for us. Here, summarization of data not only includes copying salient and important aspect...

Citations

... The literature shown in Table 1 provides insights into the above. Apart from the TF-IDF technique, there are numerous to extract features from text data in language processing tasks; a few are listed in Table 2. Studies like Keyword Extraction have been done with unsupervised learning methods using TextRank, RAKE, and PositionRanking on research documents Table 2 [1], which also contributes to the many NLP applications. The extractive summarization technique used for text summarization work better along with the TF-IDF score-based keywords retrieval Table 2 [2]. ...
... The workflow of the general-purpose document scanner is illustrated in Figure 2. Figure 3 represents the TF-IDF workflow. TF-IDF(term frequency-inverse document frequency) [1] score is a prevalent method used for page ranking in search engines, text summarization [2], text similarity computation, and web data mining. TF-IDF is the most widely used term-weight algorithm. ...
Article
Full-text available
Document scanning devices are used for visual character recognition, followed by text analytics in the software. Often such character extraction is insecure, and any third party can manipulate the information. On the other hand, near-edge processing devices are restrained by limited resources and connectivity issues. The primary factors that lead to exploring independent hardware devices with natural language processing (NLP) capabilities are latency during cloud processing and computing costs. This paper introduces a hardware accelerator for information retrieval using memristive TF-IDF implementation. In this system, each sentence is represented using a memristive crossbar layer, with each column containing a single word. The number of matching scores for the TF and IDF values was implemented using operational amplifier-based comparator accumulator circuits. The circuit is designed with a 180nm CMOS process, Knowm Multi-Stable Switch memristor model, and WOx device parameters. We compared its performance with that of a standard benchmark dataset. Variability and device-to-device related issues were also taken into consideration in the analysis. This paper concludes with implementing TF-IDF score calculation for applications such as information retrieval and text summarization.
... Based on prior works, we created a news recommender system based on user log history using keyword extraction. The method we use in the keyword extraction is RAKE, because RAKE is unsupervised method used to perform a keyword extraction process in a larger amount of data from several types of individual documents than other keyword extraction algorithms [7], [8]. There are also several studies that discuss the RAKE. ...
... There are also several studies that discuss the RAKE. Thushara, M. G. et al. [8] compared the performance of 3 keyphrase extraction algorithms such as Textrank, RAKE, and Position Rank. Based on the comparison results, Position Rank gives better results than Textrank and RAKE. ...
Article
Full-text available
There are many ways to find information; one of them is reading online news. However, searching for news online becomes more difficult because we should visit multiple platforms to find information. Sometimes, the recommended news doesn't match the user's interests. In many prior works, news recommendations are based on trending. Thus, the recommended news may not necessarily match the user's interests. To overcome this, we built a web-based news recommender system to make it easier for users to find news. We use the Rapid Automatic Keyword Extraction (RAKE) method in the recommendation process because this method can recommend news based on user preferences by utilizing user history logs. RAKE converts the title and content of the news into vector representation using Count vectorizer and applies the Cosine Similarity function to compare similarities between news. The test results show that the average performance of our proposed system is 90.8%, this accuracy outperforms earlier systems in terms of performance by the purpose of the recommender system, i.e., diversity, novelty, and relevance.
... The author studied and analysed key extraction techniques TextRank, PositionRank, KEA, and Multi-purpose automated topic indexing (MAUI) in [12]The overview of the text is provided by Keywords, Keys. Keywords and key phrases aid comprehension of the material presented in the research paper [13]. ...
Chapter
Suresh, SweetyKrishna, GopikaThushara, M. G.The advancement in the technology rises online unstructured data. As the data grow rapidly, tackling the information is becoming hard. There is a demand to maintain these unstructured data to gather important insights. Clustering of the text documents has become leading edge over Internet. Document clustering is mainly described as grouping of the similar documents. It plays vital role for establishing massive information. The paper shows an overview of study done on different clustering algorithms on covid data. The study of the semantic links between words and concepts in texts aids in the classification of documents based on their meaning and conception. The clusters were visualized using the k-means clustering technique, which was then evaluated using t-distributed stochastic neighbor embedding (t-SNE) and principal component analysis (PCA).
... PageRank is indeed a graph-based method that is built on random walks. It is fine for sifting through web pages and social media pages, but it cacannot extract crucial information from authorized manuscripts [17,31]. The PositionRank is the extension of PageRank that has been established to improve performance, and it evaluates words by considering all of their placements and frequency, determining their rank. ...
Article
Full-text available
Automated keyphrase extraction is crucial for extracting and summarizing relevant information from a variety of publications in multiple domains. However, the extraction of good-quality keyphrases and the summarising of information to a good standard have become extremely challenging in recent research because of the advancement of technology and the exponential development of digital sources and textual information. Because of this, the usage of keyphrase features for keyphrase extraction techniques has recently gained tremendous popularity. This paper proposed a new unsupervised region-based keyphrase centroid and frequency analysis technique, named the KCFA technique, for keyphrase extraction as a feature. Data/datasets collection, data pre-processing, statistical methodologies, curve plotting analysis, and curve fitting technique are the five main processes in the proposed technique. To begin, the technique collects multiple datasets from diverse sources, which are then input into the data pre-processing step by utilizing some text pre-processing processes. Afterward, the region-based statistical methodologies receive the pre-processed data, followed by the curve plotting examination and, lastly, the curve fitting technique. The proposed technique is then tested and evaluated using ten (10) best-accessible benchmark datasets from various disciplines. The proposed approach is then compared to our available methods to demonstrate its efficacy, advantages, and importance. Lastly, the results of the experiment show that the proposed method works well to analyze the centroid and frequency of keyphrases from academic articles. It provides a centroid of 706.66 and a frequency of 38.95% in the first region, 2454.21 and 7.98% in the second region, for a total frequency of 68.11%.
... Lahiri et al. (2017) used supervised and unsupervised learning methods for extracting keywords from emails to explore the topics in the mails and avoiding excessive information. Thushara et al. (2019) provided a comparative study of unsupervised methods Position Rank, TextRank, Rapid automatic keyword extraction (RAKE) for keypharse extraction. Ying et al. (2017) proposed a graph based method for keypharse extraction considering important sentences in mind and word-sentence relationships. ...
Article
Full-text available
Retrieving keywords in a text is attracting researchers for a long time as it forms a base for many natural language applications like information retrieval, text summarization, document categorization etc. A text is a collection of words that represent the theme of the text naturally and to bring the naturalism under certain rules is itself a challenging task. In the present paper, the authors evaluate different spatial distribution based keyword extraction methods available in the literature on three standard scientific texts. The authors choose the first few high-frequency words for evaluation to reduce the complexity as all the methods are somehow based on frequency. The authors find that the methods are not providing good results particularly in the case of the first few retrieved words. Thus, the authors propose a new measure based on frequency, inverse document frequency, variance, and Tsallis entropy. Evaluation of different methods is done on the basis of precision, recall, and F-measure. Results show that the proposed method provides improved results.
... For example, text mining techniques have come a long way from being used only for books, patents, and scholarly articles to nowadays covering social media and other online content, as well as brand-related online sentiment [21][22][23][24][25]. Modern social media research builds on web scraping tools like Facepager and Netvizz [26] as well as a variety of text processing techniques, software packages, and algorithms (with statistically demanding calculations based on latent Dirichlet allocation (LDA) [24,27,28] to text mining packages for automated text mining, such as "NVivo", "QDA Miner", the "qdap" package in R and "AntCont" [23,26,29,30]). Keyword extraction algorithms, e.g., PositionRank, TextRank, and RAKE, thereby rely on predefined dictionaries of keywords (rapid automatic keyword extraction) [31,32]. However, novel approaches dealing with digital communication envi-ronments need to be methodologically robust and lean on a variety of methods offered in basic sciences, such as anthropology [33,34], ethnography [35], sociology [36,37], and philosophy [38,39]. ...
Article
Full-text available
This study set out to uncover brand positioning configurations by presenting state-ofthe- art brand management literature and applying a novel, mixed-methods approach to examine the under-researched wine industry transformation towards open innovation in branding. German winery brands were analyzed using a multimethod approach leaning on a novel netnographic methodology and multiple sources. The sample included 572 wineries from all 13 German wine regions with website text data and online review text data from each winery. The study identified eight prime keywords used to describe both brand identity as well as wine brand image. It revealed word–price clusters of brand identity and image. The results offer insights into communication and pricing opportunities for wine brand identity as well as image, thereby contributing to open brand innovation.
... The paper proposes to give an overview and survey of various data clustering algorithms namely k-means [12], [13], hierarchical [14], [15], density-based [16], grid-based [17] and keyword extraction algorithms like RAKE [18], MAUI [19], TextRank [20] and TF-IDF [21]. The paper is organized with the following sections-Literature Survey -highlighting major contributions of algorithms and techniques in the area, Result and Analysis -highlights the implementation of some of the classic keyword extraction [22], [23] and clustering algorithm and a comparative study is given, and Conclusion. ...
Conference Paper
Natural Language process (NLP) has continually been a major focus of firms and establishments alike. There are several important contributions during this field of engineering science. As every year publications are contributed in the different research domain, there is a need for keyword extraction and document clustering. This helps in organizing the publications of an Institution. The paper aims at giving an overview of various keyword extraction and document clustering algorithms. This study contributes to the area of text and document-based search engines. Also, the study helps to identify the domain of research documents and to categorize them based on their semantics. The paper aims in giving a comparative study of different keyword extraction algorithm and clustering algorithm by which one can model a new prototype for document-based search engines and categorizing the documents based on various mainstream research domains of Computer Science.
... This research uses Rapid Automatic Keyword Extraction (RAKE) as a keyword extraction technique for its simplicity and easy to implement [14]. Nonetheless, this algorithm is computational efficient, fast, and precised [15]. RAKE does not need corpus for keyword discovery [16]. ...
Article
Full-text available
Among the KM processes that function to guarantee access to knowledge is knowledge sharing. This process allows knowledge assets and experiences possessed by the organization to be accessed by anyone in the organization. Especially by using IT, this process can be done more optimally by capturing existing knowledge into a system so that this valuable information can be monitored anytime and anywhere. There are times when the knowledge possessed by experts is difficult to capture and represent in the system as in the case of tacit knowledgesuch as instincts, insights, and experiences of the experts. One of the challenges in inventorying these experts is the process of creating expert profiles automatically based on a particular approach. This research create an Expert Locator for lecturers who are considered as experts in their field of research using publication data produced by these lecturers as an indication of their expertise. The search feature is made as an implementation of the extraction results that can be used by other parties to find experts by entering keywords in the form of the desired expertise.
... The domain of a research paper can be determined based on extracted keywords and keyphrases. It is monotonous to manually extract keywords and keyphrases [4]. Automatic extraction of keyword techniques helps to overcome this challenging task. ...
... The automatic extraction of keywords and keyphrases using machine learning has taken either supervised or unsupervised approaches.Supervised methods acquire knowledge from a global text set, while unsupervised methods extract keyphrases without previous training by evaluating their significance in a single document context. The model is trained on data in a supervised approach to find out if a given phrase is a keyphrase or not [2].Supervised approach requires a large set of training data to train an algorithm to extract relevant keyphrase from a document [4]. Unsupervised approach do not require the usage of any training data,instead the keywords and keyphrases are determined using various properties of the text in the document. ...
... KEA, upon implementation revealed that it is definitely not an efficient algorithm especially if the need of the user is to identify the domain of the paper or bring out a meaningful phrase which would give the summary of what the paper or article is about since it provides us with output which consists of only a set of keywords which makes it look less useful in contrast to the other three algorithms since just a single word alone cannot give more meaningful and general information about a document when compared to the output which contains phrases. PositionRank algorithm checks both the position and the frequency of terms in the document [4] before determining whether a term is a keyword or not,and it prioritises the words that appear in the beginning of the document and has more frequency than other candidate phrases. So to say the group of words or phrases that appear in the top sections,namely abstract and introduction have a higher likelyhood of being selected as a keyphrase as can be seen in the results that has been produced upon the implementation of this algorithm. ...
Conference Paper
Since there is an increasing number of research documents published every year, the documents available on the Internet will also be increasing rapidly. This poses the need to categorize the available research articles into their respective domain to ease the search process and find their research documents under the specific domain. This classification is a tiresome and prolonged process, which can be avoided by using keywords and keyphrases. Keywords or keyphrases provides a summary or information described in a research document. The domain of a research paper can be determined based on extracted keywords and keyphrases. It is monotonous to manually extract keywords and key phrases [4]. Automatic extraction of keyword techniques helps to overcome this challenging task. The classification of these research papers can be achieved more efficiently by using the keywords applicable to a particular domain. This paper aims to compare key extraction algorithms such as TextRank, PositionRank, keyphrase extraction algorithm (KEA) and Multi-purpose automatic topic indexing (MAUI).
... The previous studies on keywords generation can be traced in these scholars' works (Joshi & Motwani, 2006;Thomaidou & Vazirgiannis, 2011;Hussey et al., 2012;Liu et al., 2014;Savva et al., 2014;Scholz, et al., 2019;Arora & Kumar, 2019;Zheng & Sun, 2019;Thushara et al., 2019). Among the works by these researchers, Scholz, et al. (2019) propose an automated approach for generating keywords for Sponsored Search Advertising based on his keyword generation algorithms. ...