ArticlePDF Available

Abstract

In this paper, the use of TF-IDF stands for (term frequency-inverse document frequency) is discussed in examining the relevance of key-words to documents in corpus. The study is focused on how the algorithm can be applied on number of documents. First, the working principle and steps which should be followed for implementation of TF-IDF are elaborated. Secondly, in order to verify the findings from executing the algorithm, results are presented, then strengths and weaknesses of TD-IDF algorithm are compared. This paper also talked about how such weaknesses can be tackled. Finally, the work is summarized and the future research directions are discussed.
International Journal of Computer Applications (0975 8887)
Volume 181 No.1, July 2018
25
Text Mining: Use of TF-IDF to Examine the Relevance of
Words to Documents
Shahzad Qaiser
School of Computing
Universiti Utara Malaysia
Sintok, Kedah, Malaysia
Ramsha Ali
School of Quantitative Sciences
Universiti Utara Malaysia
Sintok, Kedah, Malaysia
ABSTRACT
In this paper, the use of TF-IDF stands for (term frequency-
inverse document frequency) is discussed in examining the
relevance of key-words to documents in corpus. The study is
focused on how the algorithm can be applied on number of
documents. First, the working principle and steps which
should be followed for implementation of TF-IDF are
elaborated. Secondly, in order to verify the findings from
executing the algorithm, results are presented, then strengths
and weaknesses of TD-IDF algorithm are compared. This
paper also talked about how such weaknesses can be tackled.
Finally, the work is summarized and the future research
directions are discussed.
Text Mining
Text Analytics
Keywords
TF-IDF, Data Mining, Relevance of Words to Documents
1. INTRODUCTION
The processing of structured or semistructured data in all
organizations is becoming very difficult as the data has been
increased tremendously [1], [2]. There are many techniques or
algorithms that can be used to process data but this study is
focused on one of those, known as TF-IDF. TF-IDF is a
numerical statistic that shows the relevance of keywords to
some specific documents or it can be said that, it provides
those keywords, using which some specific documents can be
identified or categorized. For example, a blogger is running a
blog with hundreds of contributors and he just hired an
internee whose main task is to add new blog posts on daily
basis. It has been observed that most of the times internees
does not take care of tags due to which many blog posts are
not categorized. This is one of the ideal condition for applying
TF-IDF algorithm, which can identify the tags automatically
for the bloggers. It will save plenty of time for bloggers and
internees, as they would not need to take care of tags [3].
The paper organization is as follows: In section 2, the
background of TF-IDF algorithm will be discussed, then in
section 3, the procedure and research method will be
discussed. In section 4, implementation details will be
explained for TF-IDF along its results, then section 5 will
discuss the limitations of the algorithm and its related work
and section 6, will elaborate how those limitations can be
resolved through some new techniques. Finally, Section 7 will
conclude the work and will discuss the future prospects.
2. BACKGROUND
2.1 Term Frequency (TF)
TF-IDF is a combination of two different words i.e. Term
Frequency and Inverse Document Frequency. First, the term
“term frequency” will be discussed. TF is used to measure
that how many times a term is present in a document [4].
Let’s suppose, we have a document “T1” containing 5000
words and the word “Alpha” is present in the document
exactly 10 times. It is very well known fact that, the total
length of documents can vary from very small to large, so it is
a possibility that any term may occur more frequently in large
documents in comparison to small documents. So, to rectify
this issue, the occurrence of any term in a document is divided
by the total terms present in that document, to find the term
frequency. So, in this case the term frequency of the word
“Alpha” in the document “T1” will be
TF = 10/5000 = 0.002
2.2 Inverse Document Frequency (IDF)
Now, inverse document frequency will be discussed. When
the term frequency of a document is calculated, it can be
observed that the algorithm treats all keywords equally,
doesn’t matter if it is a stop word like “of”, which is incorrect.
All keywords have different importance. Let’s say, the stop
word “of” is present in a document 2000 times but it is of no
use or has a very less significance, that is exactly what IDF is
for. The inverse document frequency assigns lower weight to
frequent words and assigns greater weight for the words that
are infrequent. For example, we have 10 documents and the
term “technology” is present in 5 of those documents, so the
inverse document frequency can be calculated as [4]
IDF = log_e (10/5) = 0.3010
2.3 Term Frequency - Inverse Document
Frequency (TF-IDF)
From Section 2.1 and 2.2, it is understood that, the greater or
higher occurrence of a word in documents will give higher
term frequency and the less occurrence of word in documents
will yield higher importance (IDF) for that keyword searched
in particular document. TF-IDF is nothing, but just the
multiplication of term frequency (TF) and inverse document
frequency (IDF). We have already calculated TF and IDF in
Section 2.1 and 2.2, respectively. To calculate the TF-IDF we
can do as [4]
TF-IDF = 0.002*0.3010 = 0.000602
3. PROCEDURE & RESEARCH
METHOD
3.1 Data Collection
The data was collected from 20 different random websites that
belongs to 4 different domains. It was simple website’s
content that is used in this study which is available to general
public to consume for free over the internet.
International Journal of Computer Applications (0975 8887)
Volume 181 No.1, July 2018
26
Table 1. Domains & Websites
No.
Domains
Website’s Count
1
.biz
5
2
.com
5
3
.edu
5
4
.org
5
Total
4
20
3.2 Data Preprocessing
Data is collected from websites hence collected data contained
HTML/CSS which was not useful hence it was completely
removed first. Secondly, the data also contained many stop
words which is never meaningful or useful in this context as
explained in Section 2.2. Hence in order to filter those stop
words, a list of 500 stop words was used first which filtered
the data and removed all stop words from it [1], [4], [5]. A
large list of stop words can easily be obtained from many
blogs and websites where it is available for free for general
public to consume.
3.3 Design
In this study, there are a few steps that has to be followed in
order to successfully implement the TF-IDF algorithm.
First of all, data has to be fetched from websites. Secondly, in
preprocessing phase, one has to look for HTML/CSS and stop
words and remove all of them as they are unnecessary has no
importance in this scenario. Third, one need to count total
number of words and their occurrences in all documents.
Once these steps are performed, one can apply Term
Frequency formula to calculate TF as discussed in Section 2.1
[1], [4], [6].
After calculating TF, one has to check, if each word is found
in every document or not and has to count total number of
documents in hand. Once these steps are completed, one can
apply Inverse Document Frequency (IDF) formula to
calculate IDF as discussed in Section 2.2 [4].
In the last, after obtaining TF and IDF, one can easily
calculate TF-IDF by applying its formula as described in
Section 2.3. The algorithm can be implemented in any
programming language of your choice. For this paper, it was
implemented using PHP for the sake of simplicity [4].
The TF-IDF process can be observed using a diagram here
which is showing all major and minor steps that has to be
taken in order to successfully implement the algorithm using
computer programming.
Fig 1: TF-IDF Process
International Journal of Computer Applications (0975 8887)
Volume 181 No.1, July 2018
27
The Fig 1. Should be followed from top to bottom in order to
implement TF-IDF. The process is quite simple but one really
need to take special care on data preprocessing as it is very
important in order to achieve accurate results.
4. IMPLEMENTATION & DISCUSSION
The algorithm is implemented in PHP hence it can be used via
a web browser. The interface is very easy to use where a user
has to select document by clicking on browse button. After
providing the document, the program will execute all the steps
as mentioned in Fig 1 and will give output as shown below in
Fig 2. The program gives serial number, word, their
occurrences, term frequency, inverse document frequency and
Finally the TF-IDF on which this study is focused on.
the thing that should be noticed here is, one can sort the
output of algorithm either in ascending order or descending
order based on their occurrences or their TF-IDF score so that
the keywords having greater occurrences or greater TF-IDF
score would come on the top in decreasing order or the
keywords having lower occurrences or lower TF-IDF score
would come on the top in increasing order. That can really
help in analyzing or slicing the data to generate reports or
visualizations.
The program can be executed with minimum, a few
microseconds time to a few seconds or a minute, depending
on the size of the provided dataset.
Fig 2: TF-IDF Program interface after providing a dummy documents
Table 2. Results by running algorithm on three documents from each domain
Domain
Keyword Rank (TF-IDF)
Keywords
TF
TF-IDF
.biz
Parts
0.02189781
0.01044791
Garden
0.02120141
0.01011564
.com
Presidential
0.00471698
0.00225057
Held
0.002105263
0.00100446
.edu
Years
0.080519480
0.03841755
Ms
0.038961038
0.01858913
.org
Marking
0.014084507
0.00672001
Scholarships
0.017414965
0.00306662
Fig 3: Top keywords as per their TF-IDF Score from 3 documents each for all domains
0.0104
0.0023
0.0384
0.0067
0.000
0.004
0.008
0.012
0.016
0.020
0.024
0.028
0.032
0.036
0.040
parts
presidential
Years
Marking
.biz
.com
.edu
.org
International Journal of Computer Applications (0975 8887)
Volume 181 No.1, July 2018
28
Table 2. Shows top two keywords according to highest TF-
IDF score for three documents only from each domain. The
data is fetched from the PHP program after running it on
dataset collected from all domains as shown in Fig 2. Table 2.
Depicts the fact that three documents of .biz domain that were
selected, the most relevant keywords that can describe those
documents are “parts” and “garden”. Similarly, in three
documents of .com domain, the top two words are
“presidential” and “held”, also in .edu domain the top
keywords are “years” and “Ms”, which shows that these
keywords can describe those documents and can be used as
tags to categorize the content. The .org domain also has
“Marking” and “Scholarships” keywords for the same
purpose.
Fig 3. Shows only the top one keyword from each domain for
three documents. It can be seen here that, the highest TF-IDF
score is of the keyword “Years” that belongs to “.edu”
domain. The second highest TF-IDF score is of the keyword
“parts” that belongs to “.biz” domain and similarly the third
position goes to “Marking” which belongs to “.org” domain.
The algorithm was tested again, this time with all documents
that belongs to all domains
Table 3. Results Top 12 Keywords from all documents
Domains
Keywords
TF-IDF
.edu
Nts
0.07120161
.edu
Islamabad
0.06181321
.edu
Karachi
0.05132980
.edu
Bahria
0.04534312
.edu
Campus
0.04223501
.edu
Years
0.04122134
.com
Product
0.03543481
.biz
Goods
0.03212431
.edu
University
0.03012134
.com
Equipments
0.02112122
.org
Engineering
0.01233121
.org
Police
0.01121121
Fig 4: Top keywords as per their TF-IDF Score from all documents for all domains
Table 3. Shows top 12 keywords as per their highest TF-IDF
score achieved from all documents and domains. In Fig 4, it
can be observed that, when content was processed from all
domains such as .biz, .com, .edu and .org, the most important
and relevant keywords are displayed as a result in sorted form,
means that, the keywords with highest TF-IDF score is on top
and the lowest is at the end. It proves the fact that in TF-IDF
algorithm you get results from most relevant to most
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Nts
Islamabad
Karachi
Bahria
Campus
Years
Product
Goods
University
Equipments
Engineering
Police
.edu
.edu
.edu
.edu
.edu
.edu
.com
.biz
.edu
.com
.org
.org
International Journal of Computer Applications (0975 8887)
Volume 181 No.1, July 2018
29
irrelevant keywords [3] which is very important, if one has to
analyze the data or needs to generate tags for specifying the
category of some document or blog post.
5. LIMITATIONS & RELATED WORK
There are some limitations of TF-IDF algorithm that needs to
be addressed. The major constraint of TF-IDF is, the
algorithm cannot identify the words even with a slight change
in it’s tense, for example, the algorithm will treat “go” and
“goes” as two different independent words, similarly, it will
treat “play” and “playing”, “mark” and “marking”, “year” and
“years” as different words. Due to this limitation, when TF-
IDF algorithm is applied, sometimes it gives some unexpected
results [7]. Another limitation of TF-IDF is, it cannot check
the semantic of the text in documents and due to this fact, it is
only useful until lexical level. It is also unable to check the
co-occurrences of words. There are many techniques that can
be used to improve the performance and accuracy as
discussed by [8], such as Decision Trees, Pattern or rule based
classifiers, SVM classifiers, Neural Network classifiers and
Bayesian classifiers. Another author [9] has also detected
defect in standard TF-IDF that it is not effective if the text that
needs to be classified is not uniform, so the author has
proposed an improved TF-IDF algorithm to deal with that
situation. Another author [10] has mixed TF-IDF with Naïve
Bayes for proper classification while considering the
relationships between classes.
6. SOLUTIONS
With the passage of time, new algorithms are coming up that
resolves some limitations of older algorithms. For example,
stemming process can be used to overcome the issues of TF-
IDF not being able to identify that the “play” and “plays” are
basically the same words [5]. The stemming process is
basically used for conflating different forms of any particular
word such as “play” and “plays” or “played” into a single and
more generic representation such as “play”. Secondly, the stop
words can be added as much as possible so that the words that
are not of any value such as “the” or “a” are filtered and
removed before the data processing [5]. This will ensure to
some extent, that you are getting useful words as output.
7. CONCLUSION
TF-IDF algorithm is easy to implement and is very powerful
but one cannot neglect its limitations. In today’s world of big
data, world requires some new techniques for data processing,
before analysis is performed. Many researchers has proposed
an improved form of TF-IDF algorithm known as Adaptive
TF-IDF. The proposed algorithm incorporated the hill-
climbing for boosting the performance. A variant of TF-IDF
has also been observed that can be applied in cross-language
by using statistical translation. Genetic algorithms have also
been put in work to improve the TF-IDF, as the natural
genetic concepts of cross over, and mutation was applied
programmatically, but it did not see the light of sun, as there
was very slight improvement in performance. Search engine
giants like Google has adapted latest algorithms such as
PageRank for bringing out the most relevant results when a
user place a query. In future research, world is going to
witness some new techniques that can overcome the
limitations of TF-IDF, so that the query retrieval can be more
accurate. TF-IDF can be combined with other techniques such
as Naïve Bayes to get even better results.
8. ACKNOWLEDGMENT
The authors wish to thank all the respected professors who
helped during experiment, development and writing phase of
this paper.
9. REFERENCES
[1] Bafna, P., Pramod, D., and Vaidya, A. (2016).
"Document clustering: TF-IDF approach," International
Conference on Electrical, Electronics, and Optimization
Techniques (ICEEOT), Chennai, 2016, pp. 61-66
[2] Trstenjak, B., Mikac, S., & Donko, D. (2014). KNN
with TF-IDF based framework for text categorizationIn
Procedia Engineering. Vol. 69, pp. 13561364. Elsevier
Ltd
[3] Gautam, J., & Kumar, E.L. (2013). An Integrated and
Improved Approach to Terms Weighting in Text
Classification,” International Journal of Computer
Science Issues, Vol 10, Issue 1, No 1, January 2013
[4] Hakim, A. A., Erwin, A., Eng, K. I., Galinium, M., &
Muliady, W. (2015). Automated document
classification for news article in Bahasa Indonesia based
on term frequency inverse document frequency (TF-IDF)
approach,” 6th International Conference on Information
Technology and Electrical Engineering: Leveraging
Research and Technology, (ICITEE), 2014
[5] Gurusamy, V., & Kannan, S. (2014). Preprocessing
Techniques for Text Mining,” RTRICS, pp. 7-16
[6] Nam, S., and Kim, K. (2017). "Monitoring Newly
Adopted Technologies Using Keyword Based Analysis
of Cited Patents," IEEE Access, vol. 5, pp. 23086-23091
[7] Ramos, J. (2003). Using TF-IDF to Determine Word
Relevance in Document Queries,” Proceedings of the
First Instructional Conference on Machine Learning, pp.
14
[8] Santhanakumar, M., and Columbus, C.C. (2015).
“Various Improved TFIDF Schemes for Term Weighing
in text Categorization: A Survey," International Journal
of Applied Engineering Research, vol. 10, no. 14, pp.
11905-11910
[9] Dai, W. (2018). Improvement and Implementation of
Feature Weighting Algorithm TF-IDF in Text
Classification,” International Conference on Network,
Communication, Computer Engineering (NCCE 2018),
vol. 147
[10] Fan, H., and Qin, Y. (2018). Research on Text
Classification Based on Improved TF-IDF Algorithm,”
International Conference on Network, Communication,
Computer Engineering (NCCE 2018), vol. 147
IJCATM : www.ijcaonline.org
... (2) Word Frequency Statistical Analysis: We obtain word frequency statistics to determine the keywords with high frequency in the text corpus, which helps to reduce data dimensions and thus alleviates the burden of subsequent model training. In this study, the TF-IDF [25] method is used to calculate the weights of the feature words in the text corpus, and these feature words are ranked according to the magnitude of their weights. The words with a TF-IDF value higher than a specified threshold are selected as the final feature words. ...
Article
Full-text available
Industrial product e-commerce refers to the specific application of the e-commerce concept in industrial product transactions. It enables industrial enterprises to conduct transactions via Internet platforms and reduce circulation and operating costs. Industrial literature, such as policies, reports, and standards related to industrial product e-commerce, contains much crucial information. Through a systematical analysis of this information, we can explore and comprehend the development characteristics and trends of industrial product e-commerce. To this end, 18 policy documents, 10 industrial reports, and five standards are analyzed by employing text-mining methods. Firstly, natural language processing (NLP) technology is utilized to pre-process the text data related to industrial product commerce. Then, word frequency statistics and TF-IDF keyword extraction are performed, and the word frequency statistics are visually represented. Subsequently, the feature set is obtained by combining these processes with the manual screening method. The original text corpus is used as the training set by employing the skip-gram model in Word2Vec, and the feature words are transformed into word vectors in the multi-dimensional space. The K-means algorithm is used to cluster the feature words into groups. The latent Dirichlet allocation (LDA) method is then utilized to further group and discover the features. The text-mining results provide evidence for the development characteristics and trends of industrial product e-commerce in China.
... Conversely, IDF is a measure of how rare or common a word is across the entire set of documents. It assigns a lower weight to frequent words and a higher weight to infrequent words (Qaiser & Ali, 2018). The product of TF and IDF yields the TF-IDF score, a numerical statistic that reflects the importance of a word to a document within a collection or corpus. ...
Article
Full-text available
The United Nations Agenda 2030, inclusive of its 17 sustainable development goals (SDGs), serves as the global blueprint for sustainability for both present and future generations. Scientific research is entrusted with the responsibility of contributing by informing the current situation and future challenges in achieving the SDGs. This paper investigates the role of social psychology in contributing to the SDGs and the environmental, economic and social pillars of the UN Agenda. We analysed 4808 papers using Natural Language Processing to identify (i) the relevance of social psychology within the SDG‐related literature and (ii) the current and potential contribution of social psychology to the SDGs. Results highlight that social psychology contributes to the SDGs by addressing typical social issues, primarily those related to health and gender, while noting its under‐representation in some environmental and economic areas, despite social psychology well‐established research on these topics. This paper introduces a novel approach for assessing the SDGs, fostering a critical reflection on the SDG framework and social psychology to guide less explored research paths. This approach could potentially enhance the evaluation and advancement of the 2030 Agenda, facilitating a deeper dialogue between the scientific community and policymakers, driving social change.
... "Semantic links" refer to the relationships between words based on their meanings, derived from semantic networks [81]. To explore connections between incentive and organic reviews, we used TF-IDF to assess the significance of words or phrases by comparing their frequencies within the document to the entire corpus [82]. We randomly selected 15,000 reviews from each incentive and organic category and applied feature extraction to the "preprocessed_CombinedString" to generate trigrams. ...
Article
Full-text available
Online reviews play a crucial role in influencing seller–customer dynamics. This research evaluates the credibility and consistency of reviews based on volume, length, and content to understand the impacts of incentives on customer review behaviors, how to improve review quality, and decision-making in purchases. The data analysis reveals major factors such as costs, support, usability, and product features that may influence the impact. The analysis also highlights the indirect impact of company size, the direct impact of user experience, and the varying impacts of changing conditions over the years on the volume of incentive reviews. This study uses methodologies such as Sentence-BERT (SBERT), TF-IDF, spectral clustering, t-SNE, A/B testing, hypothesis testing, and bootstrap distribution to investigate how semantic variances in reviews could be used for personalized shopping experiences. It reveals that incentive reviews have minimal to no impact on purchasing decisions, which is consistent with the credibility and consistency analysis in terms of volume, length, and content. The negligible impact of incentive reviews on purchase decisions underscores the importance of authentic online feedback. This research clarifies how review characteristics sway consumer choices and provides strategic insights for businesses to enhance their review mechanisms and customer engagement.
... On the other hand, TF-IDF considers words that are frequent within a document but rare across the entire corpus, effectively highlighting important terms while mitigating the impact of common "stopwords" such as pronouns, prepositions, and connectors [113]. However, these methods primarily focus on word frequency and lack the ability to capture contextual relationships within the generated embeddings [114]. ...
Preprint
Full-text available
Crime linkage is the process of analyzing criminal behavior data to determine whether a pair or group of crime cases are connected or belong to a series of offenses. This domain has been extensively studied by researchers in sociology, psychology, and statistics. More recently, it has drawn interest from computer scientists, especially with advances in artificial intelligence. Despite this, the literature indicates that work in this latter discipline is still in its early stages. This study aims to understand the challenges faced by machine learning approaches in crime linkage and to support foundational knowledge for future data-driven methods. To achieve this goal, we conducted a comprehensive survey of the main literature on the topic and developed a general framework for crime linkage processes, thoroughly describing each step. Our goal was to unify insights from diverse fields into a shared terminology to enhance the research landscape for those intrigued by this subject.
... Each document within every cluster was modeled using topic representation. We applied a variation of TF-IDF (Wu et al. 2008;Qaiser and Ali 2018) known as c-TF-IDF (eq. 1) (Aizawa 2003;Hong et al. 2013;Yahav et al. 2018), where 't' represents terms (words), 'c' denotes class, 'A' signifies the average word count for each class, and the term 'f' represents the frequency of 't' in 'c'. The c-TF-IDF algorithm can account for the differences between long and short texts by considering not only word frequency but also the significance of candidate concepts (Grootendorst 2022;Aytaç and Khayet 2023). ...
Article
Grassland degradation presents overwhelming challenges to biodiversity, ecosystem services, and the socio-economic sustainability of dependent communities. However, a comprehensive synthesis of global knowledge on the frontiers and key areas of grassland degradation research has <pg=>not been achieved due to the limitations of traditional scientometrics methods. The present synthesis of information employed BERTopic, an advanced natural language processing tool, to analyze the extensive ecological literature on grassland degradation. We compiled a dataset of 4,504 publications from the Web of Science core collection database and used it to evaluate the geographic distribution and temporal evolution of different grassland types and available knowledge on the subject. Our analysis identified key topics in the global grassland degradation research domain, including the effects of grassland degradation on ecosystem functions, grassland ecological restoration and biodiversity conservation, erosion processes and hydrological models in grasslands, and others. The BERTopic analysis significantly outperforms traditional methods in identifying complex and evolving topics in large datasets of literature. Compared to traditional scientometrics analysis, BERTopic provides a more comprehensive perspective on the research areas, revealing not only popular topics but also emerging research areas that traditional methods may overlook, although scientometrics offers more specificity and detail. Therefore, we argue for the simultaneous use of both approaches to achieve more systematic and comprehensive assessments of specific research areas. This study represents an emerging application of BERTopic algorithms in ecological research, particularly in the critical research focused on global grassland degradation. It also highlights the need for integrating advanced computational methods in ecological research in this era of data explosion. Tools like the BERTopic algorithm are essential for enhancing our understanding of complex environmental problems, and it marks an important stride towards more sophisticated, data-driven analysis in ecology.
... Inverse document frequency (IDF) is calculated by taking the logarithm of the TF value and is inversely proportional to the TF. TF-IDF is obtained by multiplying TF and IDF values (Qaiser et al., 2018). In this study, the n-gram interval selected for TF-IDF was chosen as (1.2), just like in the bag of words algorithm. ...
Article
Full-text available
In the treatment of the disease, the fact that individuals use drugs independently from doctors without appropriate consultation causes their health status to become worse than normal. This article aims to conduct a sentiment analysis over the comments of individuals about the drug in case they use drugs without consultation. Within the scope of this study, patients' comments about drugs were vectorized using Bow and TF-IDF algorithms, sentiment analysis was made, and the predicted sentiments were; it was evaluated with precision, recall, f1score, accuracy and AUC score. As a result of the evaluations, the most successful result was obtained in the TF-IDF method. This result is the result of the Linear Support Vector Classifier algorithm with an Accuracy value of 93%.
Article
Full-text available
Data mining is used for finding the useful information from the large amount of data. Data mining techniques are used to implement and solve different types of research problems. The research related. It is also called knowledge discovery in text (KDT) or knowledge of intelligent text analysis. Text mining is a technique which extracts information from both structured and unstructured data and also finding patterns. Text mining techniques are used in various types of research domains like natural language processing, information retrieval, text classification and text clustering.
Article
Full-text available
This paper proposes a method that can reliably monitor the adoption of existing technology by Term Frequency-Inverse Document Frequency (TF-IDF) and K-means clustering using cited patents. TF-IDF and K-means clustering can extract patent information when the number of patents is sufficiently large. When the number of patents is too small for TF-IDF and K-means clustering to be reliable, the method considers patents that were cited by the originally set of patents. The mixed set of citing patents and cited patents is the new subject of analysis. As a case study, we have focused in agricultural tractor in which new technologies were adopted to achieve automated driving. TF-IDF and K-means clustering alone failed to monitor the adoption of new technology but the proposed method successfully monitored it. We anticipate that our method can ensure the reliability of patent monitoring even when the number of patents is small.
Conference Paper
Full-text available
Recent advances in computer and technology resulted into ever increasing set of documents. The need is to classify the set of documents according to the type. Laying related documents together is expedient for decision making. Researchers who perform interdisciplinary research acquire repositories on different topics. Classifying the repositories according to the topic is a real need to analyze the research papers. Experiments are tried on different real and artificial datasets such as NEWS 20, Reuters, emails, research papers on different topics. Term Frequency-Inverse Document Frequency algorithm is used along with fuzzy K-means and hierarchical algorithm. Initially experiment is being carried out on small dataset and performed cluster analysis. The best algorithm is applied on the extended dataset. Along with different clusters of the related documents the resulted silhouette coefficient, entropy and F-measure trend are presented to show algorithm behavior for each data set.
Article
Full-text available
Traditional text classification methods utilize term frequency (tf) and inverse document frequency (idf) as the main method for information retrieval. Term weighting has been applied to achieve high performance in text classification. Although TFIDF is a popular method, it is not using class information. This paper provides an improved approach for supervised weighting in the TFIDF model. The tfidf-weighting model uses class information to compute weighting of the terms. The model also assumes that low frequency terms are important, high frequency terms are unimportant, so it designs higher weights to the rare terms frequently. So, it uses rare term information along with class information for weighting. So, the paper proposes an improved approach which combines the benefits of the traditional kNN classifiers and Nave Bayes supervised learning method.
Article
Full-text available
Term Weighting plays a significant role in Information Retrieval (IR) and Text Classification (TC). Before classifying the text need to perform some preprocessing techniques named, Stopword removal, Stemming, Term Weighting, etc., Term Weighting is defined as assigning weight to the term based on the number of occurrences in the text document. There is plenty of improved Term Frequency-Inverse Document Frequency (TFIDF) algorithms were developed and implemented based on the traditional TFIDF. This paper presents a survey of various improved TFIDF algorithm used to assigning weight to the terms.
KNN with TF-IDF based framework for text categorization
  • B Trstenjak
  • S Mikac
  • D Donko
Trstenjak, B., Mikac, S., & Donko, D. (2014). "KNN with TF-IDF based framework for text categorization" In Procedia Engineering. Vol. 69, pp. 1356-1364. Elsevier Ltd
Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach
  • A A Hakim
  • A Erwin
  • K I Eng
  • M Galinium
  • W Muliady
Hakim, A. A., Erwin, A., Eng, K. I., Galinium, M., & Muliady, W. (2015). "Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach," 6th International Conference on Information Technology and Electrical Engineering: Leveraging Research and Technology, (ICITEE), 2014
Using TF-IDF to Determine Word Relevance in Document Queries
  • J Ramos
Ramos, J. (2003). "Using TF-IDF to Determine Word Relevance in Document Queries," Proceedings of the First Instructional Conference on Machine Learning, pp. 1-4