PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

It is now difficult to access desired information in the Internet world. Search engines are always trying to overcome this difficulty. However, web pages that cannot reach their target audience in search engines cannot become popular. For this reason, search engine optimization is done to increase the visibility in search engines. In this process, a few keywords are selected from the textual content added to the web page. A responsible person who is knowledgeable about the content and search engine optimization is required to determine these words. Otherwise, an effective optimization study cannot be obtained. In this study, the keyword extraction from textual data with latent semantic analysis technique was performed. The latent semantic analysis technique models the relations between documents/sentences and terms in the text using linear algebra. According to the similarity values of the terms in the resulting vector space, the words that best represent the text are listed. This allows people without knowledge of the SEO process and content to add content that complies with the SEO criteria. Thus, with this method, both financial expenses are reduced and the opportunity to reach the target audience of web pages is provided.
Content may be subject to copyright.
POLİTEKNİK DERGİSİ
JOURNAL of POLYTECHNIC
ISSN: 1302-0900 (PRINT), ISSN: 2147-9429 (ONLINE)
URL: http://dergipark.org.tr/politeknik
Keyword extraction for search engine
optimization using latent semantic analysis
Gizli anlamsal analiz ile arama motorları için
anahtar kelime çıkarma
Yazar(lar) (Author(s)): Fahrettin HORASAN
ORCID: 0000-0003-4554-9083
Bu makaleye şu şekilde atıfta bulunabilirsiniz(To cite to this article): Horasan F., “Keyword extraction
for search engine optimization using latent semantic analysis, Politeknik Dergisi, 24(2): 473-479, (2021).
Erişim linki (To link to this article): http://dergipark.org.tr/politeknik/archive
DOI: 10.2339/politeknik.684377
Keyword Extraction for Search Engine Optimization Using Latent
Semantic Analysis
Highlights
SEO processes were performed automatically
The study has been tested with a well-known data set.
It has become easier to create content suitable for SEO
Graphical Abstract
The following figure shows the process of obtaining keywords from textual data used as web content, in stages.
Figure. 1
Aim
In this study, the keywords that best represent the content were obtained using the LSA in order to ensure that the
content on the web pages comply with the SEO.
Design & Methodology
Latent semantic analysis known as vector space based is used in this study.
Originality
Keyword lists used in SEO were examined according to different dimension reduction parameters.
Findings
Better results were obtained with the method based on the proposed term and document similarity.
Conclusion
Keyword lists that best represent textual web content were obtained by considering their similarity with other
terms/sentences.
Declaration of Ethical Standards
The author of this article declare that the materials and methods used in this study do not require ethical committee
permission and/or legal-special permission.
Politeknik Dergisi, 2021; 24(2) : 473-479 Journal of Polytechnic, 2021; 24 (2): 473-479
473
Keyword Extraction for Search Engine Optimization
Using Latent Semantic Analysis
Araştırma Makalesi / Research Article
Fahrettin HORASAN*
Engineering Faculty, Computer Engineering Department, Kırıkkale University, Kırıkkale, Turkey
(Geliş/Received : 04.02.2020 ; Kabul/Accepted : 11.04.2020)
ABSTRACT
It is now difficult to access desired information in the Internet world. Search engines are always trying to overcome this difficulty.
However, web pages that cannot reach their target audience in search engines cannot become popular. For this reason, search engine
optimization is done to increase the visibility in search engines. In this process, a few keywords are selected from the textual content
added to the web page. A responsible person who is knowledgeable about the content and search engine optimization is required
to determine these words. Otherwise, an effective optimization study cannot be obtained. In this study, the keyword extraction from
textual data with latent semantic analysis technique was performed. The latent semantic analysis technique models the relations
between documents/sentences and terms in the text using linear algebra. According to the similarity values of the terms in the
resulting vector space, the words that best represent the text are listed. This allows people without knowledge of the SEO process
and content to add content that complies with the SEO criteria. Thus, with this method, both financial expenses are reduced and
the opportunity to reach the target audience of web pages is provided.
Keywords: Search engine optimization, keyword extraction, latent semantic analysis, text mining.
Gizli Anlamsal Analiz ile Arama Motorları için
Anahtar Kelime Çıkarma
ÖZ
Devasa bilgi yığınının bulunduğu internet dünyasında artık istenilen bilgiye erişmek zor hale geldi. Arama motorları bu zorluğun
altından kalkmak için çaba sarf etmektedirler. Ancak arama motorlarında hedef kitlesine ulaşamayan bir web sayfası popüler hale
gelememektedir. Arama motorlarındaki görünürlüğün artırılması için arama motoru optimizasyonu yapılır. Bu süreçte web
sayfasına eklenen metinsel içeriklerden anahtar kelimeler seçilir. Bu kelimelerin belirlenmesi için hem içerik hakkında ve hem de
arama motoru optimizasyonu konusunda bilgili bir sorumlu kişi gereklidir. Böyle olmadığı durumlarda etkili bir optimizasyon
çalışması elde edilemez. Bu çalışmada gizli anlamsal analiz tekniği ile metinsel verilerden anahtar kelime çıkarma işlemi
gerçekleştirilmiştir. Gizli anlamsal analiz yöntemi, metin içerisindeki doküman/cümle ve terimler arasındaki ilişkileri lineer cebir
yönüyle modellemektedir. Elde edilen vektör uzayındaki terimlerin benzerlik değerlerine göre metini en iyi temsil edilen kelimeler
listelenir. Bu işlem SEO süreci ve içerik hakkında bilgisi olmayan insanların da SEO kriterlerine uygun içerik eklemelerine imkan
tanıyacaktır. Dolayısıyla, bu yöntemle hem maddi gider azaltılmış hem de web sayfalarının hedef kitlesine ulaşma fırsatı
sağlanmıştır.
Anahtar Kelimeler: Arama motoru optimizasyonu, anahtar kelime çıkarımı, gizli anlamsal analiz, metin madenciliği.
1. INTRODUCTION
The increase in the usage of the internet and the
facilitation of the publishing of web sites lead rapid
increase in the number of web pages published on the
internet [1]. Besides, the most important purpose of
publishing a website is to ensure that the data is delivered
to target users on time. Most of the time ,under standard
circumstances, it is time consuming for users to access
the information they want on the internet since it requires
the users to know all the content on the web and where
the specific content is located. This issue is largely
overcome by web sites called web search engines.
According to World Internet Usage and Population
Statistics data, 58.8 percent of the world's population use
internet technologies in mid-2019 [2]. Figure 1 shows the
increase in the number of hostnames and active websites
in the World. Difficulties are encountered in accessing
correct information in the internet environment where
there are too many internet users and websites [3]. The
search engines are an information-navigation-tool that
enables users to access the information they want on the
internet [4]. On the other hand, in parallel with the
development of web technologies, the importance of the
search engines is increasing day by day [5].
The search engines scan contents of the web pages at
certain time intervals via a software called crawlers or
spiders and add the necessary information about the
contents of the scanned web pages to their databases.
Thus, they make it possible for internet users to access
the desired content or web pages thru themselves [6]. In
order for the crawlers to find and scan the web pages, the
Fahrettin HORASAN / POLİTEKNİK DERGİSİ, Politeknik Dergisi,2021;24(2): 473-479
474
URL details of the web sites need to be recorded in the
database of the search engines [6]. For this reason, it is
important for web developers to implement this addition
process and regularly update the site contents in order for
the websites to get more visits [6,7]. The whole set of
operations performed to ensure that websites are indexed
in the search engines in the best way is called Search
engine optimization (SEO)[8,9].
Figure 1. The increase in the number of hostnames and active websites in the World [2]
During the SEO process, some additional edits are made
in the required places of the web sites. These edits may
look like simple operations, but they have a significant
impact on the site's search results. They may improve
performance, but it should also be taken into account that
they may reduce performance so, the edits need to be
done carefully [6,7]. The web developers need to do a
separate SEO work for each site since content and
content presentation are different on each site. Moreover,
the person performing the SEO process must be
knowledgeable about the content posted on the web page,
otherwise, the SEO process may cause the website to
become a site that does not appeal to the target audience
[7].
The SEO operations can be facilitated by designing the
necessary processes automatically. There are many
studies in this subject. First of all, the SEO operations
should be carried out for the visitors, not the website
publisher. The SEO directing users to the correct web
pages are called White hat SEO [10]. Therefore, the main
principle in SEO studies is to be user oriented. There are
many studies in the literature that automatically perform
the SEO process as user-oriented or facilitate this
process. For example, a content management system that
performs white hat SEO has been developed for an active
web page called Fragfornet [11]. This system, in which
content is added and managed on the web page, realizes
SEO automatically. One of the platforms that attract
attention in the internet world is electronic market sites.
A product-content optimization was developed in a study
for the electronic market site. The basis of this study is
multi-criteria optimization model such as discount time,
visual presentation and product relations [12]. In another
study, researchers have done a new heuristic scanning
process with additional learning techniques that learn by
looking at the data that affects their web site's ranking on
the search engine. They have proved that a system that
automatically combines intuitive scans based on the data
coming from users makes a system that gets better rank
in search engines [13].
The search engines make recommendations according to
the query sentence consisting of keywords or keyphrases
[14]. Studies under the name of keyword extraction
[15,16,17], query suggestion [18] and query
classification [19] are based on this.
Latent semantic analysis (LSA) can be useful in finding
keywords or keyphrases that best represent the semantic
structure in the text [20,21]. In this study, the keywords
used in the seo process of the contents added to the web
pages were determined with the LSA model. The LSA,
which is used in many fields such as text mining, image
processing, data mining, signal processing, voice
analysis, is a dimension reduction based approach
[21,22,23]. The tests performed according to the
parameters in the technique called rank-k in the
dimension reduction stage were examined. Thus, a more
efficient and faster recommendation process was
obtained. The contributions of the developed model can
be listed as follows.
SEO processes are performed automatically for each
content added to the web pages.
Someone who has nothing to do with the content or
seo stages can also add content that complies with
the SEO criteria.
The study has been tried with a well-known data set.
It is comparable to future studies
KEYWORD EXTRACTION FOR SEARCH ENGINE OPTIMIZATION USING LATENT SEMAN… Politeknik Dergisi, 2021; 24 (2) : 473-479
475
It was examined according to different dimension
reduction parameters.
The study can shed light on studies such as topic
extraction, query classification.
In the next part of the paper, the realization of the seo
process with the latent semantic analysis is explained. In
Section 3, experiments and test parameters are explained.
The last part is the discussion and conclusion section.
2. SEO VIA LATENT SEMANTIC ANALYSIS
The Latent semantic analysis is a statistical /
mathematical technique that reveals latent relations
between term-term, term-document and document-
document [22,23]. The LSA, which is a dimension
reduction based approach, aims to consider only the
important data groups in the dataset. Data included in the
data stack that does not contribute to the meaning or
negatively affect the meaning are not included in this
process. For this, the low-rank approach of the term-
document matrix is used to find the latent semantic
structure between the term and the documents [23].
Terms and documents are represented by elements of the
row and column of the matrix, respectively. This matrix
is called as term-document matrix. i-th row and j-th
column of the term-document matrix contains the
mathematical value of i-th term in j-th document. This
mathematical value is known as the weight. The
calculation of this mathematical value is known as
weighting. The weight of the document for each term is
calculated according to some methods [21,23].
Usually, the textual dataset is processed in the LSA. By
passing through this dataset through the pre-process, stop
words and punctuation marks are removed. Stemming is
applied for each term. Then the weights in the term and
document matrix are obtained. In this study, the most
used TF * IDF method was chosen among the weighting
methods [21,22,24]. Thus, the term-document matrix ( )
is obtained by using the weight of the term in each
document. The SVD is applied to the obtained matrix.
The rank k approach is applied to the matrices obtained
as a result of the matrix decomposition in order to reduce
the dimension. After applying the Rank k approach, the
term matrix and the document matrix are obtained by
multiplying the left and right orthogonal matrices by the
singular value matrix, respectively [25,26]. Each row in
the term matrix represents the vector of the same indexed
term in the term-document matrix. Each column in the
document matrix represents the vector of the same
indexed document in the term-document matrix. Thus,
the term and document vectors are represented in the
same vector space. After obtaining the vector space,
documents/terms are listed according to the query from
most similar to least similar. Ultimately, the documents /
terms associated with the query (This can be a document,
term or sentence) are discovered [23,27,28].
In this study, a single text is considered as a data stack
and each sentence in this text is used as a document. The
term document (sentence) matrix is obtained for terms in
each sentence. As mentioned in the previous paragraph,
the SVD is applied to this obtained matrix. The vectors
of terms and documents are determined by the value of k
in the rank k approach. The terms are listed according to
their proximity to each other, taking into account their
similarities to all terms and phrases mentioned in the text.
Thus, word lists that have a very similar resemblance to
terms and documents can be accessed in this way. These
are the words that have the most discrimination in the text
and can represent the text. Figure 2 shows the flow chart
of the study.
Figure 2. Flow chart of keyword extraction technique with LSA
2.1. Singular Value Decomposition
The SVD of the
mn
A
is found by the formula
.
0T
A U V

=

(1)
Here m>n,
TT
m
U U UU I==
and
TT
n
V V VV I==
. In addition,
matrix of diagonal
and containing singular values of A is in the format
1 2 1 0
k k n
 
+
   =  = =
. (2)
2.2. Rank-k Approach
The rank k approach is applied to reduce the cost of
calculation and increase efficiency in Formula 2. When
the rank of A matrix is k
1 2 1k k n
 
+
     =
. (3)
Fahrettin HORASAN / POLİTEKNİK DERGİSİ, Politeknik Dergisi,2021;24(2): 473-479
476
Here,
represents the threshold value. In order for
to be the most optimal, the difference between
k
and
1k
+
should be significantly higher.
If the rank approach of the matrix A(
k
A
) is used instead
of the matrix A in LSA, Ak is represented by the equation
T
k k k k
A U V=
. (4)
k
U
and
T
k
V
in the formula represent the first k columns
of
U
and
T
V
matrices, respectively.
k
is the diagonal
matrix
1 2 3
( , , , , )
kk
diag
 
 =
.
2.3. Obtaining Vector Space
Representatives of terms and documents in vector space
are obtained as
k
T
ve
k
D
, respectively, with equations
k k k
TU=
(5)
T
k k k
DV=
(6)
Herein, the i'th row of the
k
T
matrix is the vector
symbolizing the term i'th, and the j'th column of the
k
D
matrix is the vector symbolizing the j'th document.
2.4. Listing Words
At this stage, words are listed in three ways. According
to the first, only other terms are taken into account when
calculating the similarity of the terms. According to the
similarity value, the terms are listed from the least value
to the most value. This listing method is Term Similarity
Based listing (TSBL). The second is that the words are
listed according to the similarity of the documents. The
name of this method is Document Similarity Based
Listing (DSBL). Another method of listing is the method
in which the similarity of terms and documents are
calculated together. The name of this method is Term and
Document Similarity Based Listing (TDSBL).
In calculating each of these methods, the cosine
similarity technique was used. The cosine similarity
technique was preferred because it makes an angular
similarity measurement. The cosine similarity technique
takes into account the cosine value of the angles at which
two vectors intersect with each other in the same vector
space [21,24]. The similarity between the two vectors is
calculated by the formula
1
22
11
Cos_ ( , )
n
ii
i
nn
ii
ii
AB
XY
Similarity X Y XY AB
=
==
==

. (7)
In the formula,
X
and
Y
represent
1m
dimensional
vectors.
2.5. Choosing Words and Evaluating
It is the selection of the best
N
of the words listed in the
previous stage.
N
keywords are determined by selecting
the best of the terms listed in order from least similar
to most similar. The keyword lists obtained were
examined according to the TSBL, DSBL and TDSBL
techniques. In addition, according to rank k approach,
their performances according to different
k
values were
examined. As an example, Table 1 shows the 20 most
similar word lists according to the different rank-
values of a text according to the TBDL technique.
3. EXPERIMENTAL ANALYSIS
In this study, BBC news collection was used as a data set.
In this collection, there are 1313 documents and 15393
words under 5 classes in total.
The performances of the keyword lists obtained
according to the TSBL, DSBL and TDSBL techniques
were examined according to the rank k approach. In the
word groups listed according to two different
k
values,
the number of similar words (
n
) and the similarity ratio
(
sr
) were examined. sr is calculated according to the
equation
The number of similar terms
N
sr =
. (8)
Here, N is the number of the best N terms.
The algorithm complexity of the keyword extraction
technique using the
mn
dimensional term document
matrix is
2
O( ).mn
In this study, where the Rank
k
approach is used, the algorithm complexity is
O( ).mnk
It is also considered to be
( )
k<min m,n
. Thus, a less
costly system was developed.
During the analysis process, keyword extraction was
performed for all documents in the dataset. Firstly,
figures 3.a., figures 4.a. and figures 5.a., which show
N
k
Table 1. Words listed according to different k values
k
Ordered Word List
20
boost, profit, timewarn, year, earlier, high, speed, internet, aol, revenu, catwoman, contrast, third,
final, lord, ring, trilog, full, chief, execut
15
boost, profit, timewarn, year, earlier, internet, aol, revenu, help, box, offic, alexand, catwoman,
sharp, contrast, final, ring, trilog, full, post
Similar Words
boost, profit, timewarn, year, earlier, internet, aol, revenu, catwoman, contrast, final, ring, trilogy,
full
KEYWORD EXTRACTION FOR SEARCH ENGINE OPTIMIZATION USING LATENT SEMAN… Politeknik Dergisi, 2021; 24 (2) : 473-479
477
similarity changes according to the TSBL, DSBL and
TDSBL techniques mentioned in Section 2.4, should be
examined. In these figures, the terms are sorted according
to their similarity values to all documents. As can be
seen, there is an increasing similarity change. In figure
3b, figures 4b, and 5b, which show the performances of
the 20 words that resemble the best, the similarity change
of the term groups that can represent the document better
is seen.
a. All of the listed terms b. Top 20 of the listed terms
Figure 3. Similarity changes according to the TSBL technique
a. All of the listed terms b. Top 20 of the listed terms
Figure 4. Similarity changes according to the DSBL technique
a. All of the listed terms b. Top 20 of the listed terms
Figure 5. Similarity changes according to the TDSBL technique
Fahrettin HORASAN / POLİTEKNİK DERGİSİ, Politeknik Dergisi,2021;24(2): 473-479
478
In Table 2, Table 3, and Table 4, the similarities of the
keywords according to the TSBL, DSBL and TDSBL
techniques were examined according to different rank k
values. In the TSBL technique, the best result was
observed when k 15 and 20. According to the DSBL
technique, good results are observed when k is in the
range of 20-25. In the TDSBL technique, which uses both
techniques, good results were obtained when k value is
between 15 to 20.
4. CONCLUSION
Firms/ people have to work with Search Engine
Optimization consultants or companies for their web
pages. They need this to reach their target audience or
increase their audience in the e-commerce environment.
As a result, SEO process of websites causes both labor
and financial expenses. With this study, in order to
eliminate/ reduce these losses, keyword determination
processes in the seo transaction were performed
automatically. The Latent semantic analysis and keyword
extraction method used in the study will shed light on
future studies, especially in areas such as question
answering, topic detection, and text classifying.
DECLARATION OF ETHICAL STANDARDS
The author(s) of this article declare that the materials and
methods used in this study do not require ethical
committee permission and/or legal-special permission.
AUTHORS’ CONTRIBUTIONS
Fahrettin HORASAN: performed the design and
implementation of the research, analysis of the results
and writing the article.
CONFLICT OF INTEREST
There is no conflict of interest in this study.
Table 2. Performances of the TSBL technique
rank k
rank k
The number of similar terms
sr
rank 2
rank 10
1,3
6,5
rank 2
rank 15
0,5
2,5
rank 2
rank 20
0,3
1,5
rank 2
rank 25
0,1
0,5
rank 10
rank 15
8,1
40,5
rank 10
rank 20
7
35
rank 10
rank 25
8,6
43
rank 15
rank 20
14,2
71
rank 15
rank 25
11
55
rank 20
rank 25
13,1
65,5
Table 3. Performances of the DSBL technique
rank k
rank k
The number of similar terms
sr
rank 2
rank 10
6,1
30,5
rank 2
rank 15
5,4
27
rank 2
rank 20
3,7
18,5
rank 2
rank 25
3,3
16,5
rank 10
rank 15
12,1
60,5
rank 10
rank 20
12,1
60,5
rank 10
rank 25
11
55
rank 15
rank 20
12,2
61
rank 15
rank 25
10,5
52,5
rank 20
rank 25
16
80
Table 4. Performances of the TDSBL technique
rank k
rank k
The number of similar terms
sr
rank 2
rank 10
2
10
rank 2
rank 15
1,1
5,5
rank 2
rank 20
0
0
rank 2
rank 25
1,1
5,5
rank 10
rank 15
10,2
51
rank 10
rank 20
7,3
36,5
rank 10
rank 25
8,4
42
rank 15
rank 20
15,2
76
rank 15
rank 25
14,1
70,5
rank 20
rank 25
16,4
82
KEYWORD EXTRACTION FOR SEARCH ENGINE OPTIMIZATION USING LATENT SEMAN… Politeknik Dergisi, 2021; 24 (2) : 473-479
479
REFERENCES
[1] Leavitt, Neal ,“Network-usage changes push internet
traffic to the edge.”, Computer, 43.10: 13-15 (2010).
[2] Internet World Stats Internet users of the world: World
Internet Usage And Populatıon Statıstıcs, “2019 Mid-
Year Estimates”, www.internetworldstats.com ( 2019).
[3] Wood, Steve, Web of Deception: Misinformation on the
Internet. New Library World, (2003).
[4] Yan, L., Gui, Z., Du, W., & Guo, Q., An improved
PageRank method based on genetic algorithm for web
search.”, Procedia Engineering, 15: 2983-2987, (2011).
[5] Cui, M., & Hu, S., Search engine optimization research
for website promotion. In 2011 International Conference
of Information Technology, Computer Engineering and
Management Sciences, 4: 100-103, (2011).
[6] Killoran, J. B., How to use search engine optimization
techniques to increase website visibility.”, IEEE
Transactions on professional communication, 56(1):
50-66, (2013).
[7] Yalçın, N., & Köse, U., What is search engine
optimization: SEO?.”, Procedia-Social and Behavioral
Sciences, 9: 487-493, (2010).
[8] Malaga, R. A. , Worst practices in search engine
optimization. Communications of the ACM, 51(12):
147-150, (2008).
[9] Google's Search Engine Optimization Starter Guide”,
(2013).
[10] Mittal, M. K., Kirar, N., & Meena, J. Implementation of
Search Engine Optimization: Through White Hat
Techniques.”, In 2018 International Conference on
Advances in Computing, Communication Control and
Networking (ICACCCN), 674-678), (2018).
[11] Gandour, A., & Regolini, A., Web site search engine
optimization: a case study of Fragfornet.”, Library Hi
Tech News, (2011).
[12] Asllani, A., & Lari, A., “Using genetic algorithm for
dynamic and multiple criteria web-site optimizations.”,
European journal of operational research, 176(3):
1767-1777, (2007).
[13] Boyan, J., Freitag, D., & Joachims, T., A machine
learning architecture for optimizing web search
engines.”, In AAAI Workshop on Internet Based
Information Systems, (1996).
[14] Kiritchenko, S., & Jiline, M., Keyword optimization in
sponsored search via feature selection. In New
Challenges for Feature Selection in Data Mining and
Knowledge Discovery, 122-134, (2008).
[15] Zimniewicz, M., Kurowski, K., & Węglarz, J.,
"Scheduling aspects in keyword extraction problem.”,
International Transactions in Operational Research,
25(2): 507-522, (2018).
[16] Joshi, A., & Motwani, R., “Keyword generation for
search engine advertising.”, In Sixth IEEE International
Conference on Data Mining-Workshops (ICDMW'06,
490-496, (2006).
[17] Abhishek, V., & Hosanagar, K., Keyword generation for
search engine advertising using semantic similarity
between terms.”, In Proceedings of the ninth
international conference on Electronic commerce, 89-
94, (2007).
[18] Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue
Simonsen, J., & Nie, J. Y., “A hierarchical recurrent
encoder-decoder for generative context-aware query
suggestion. In Proceedings of the 24th ACM
International on Conference on Information and
Knowledge Management, 553-562, (2015).
[19] Hong, Y., Vaidya, J., Lu, H., & Liu, W. M. Accurate and
efficient query clustering via top ranked search results. In
Web Intelligence, 14(2): 119-138, IOS Press, (2016).
[20] Süzek, T. Ö. Using latent semantic analysis for automated
keyword extraction from large document corpora.
Turkish Journal of Electrical Engineering & Computer
Sciences, 25(3): 1784-1794, (2017).
[21] Varçın, F., Erbay, H., & Horasan, F., “Latent semantic
analysis via truncated ULV decomposition.”, In 2016
24th Signal Processing and Communication
Application Conference (SIU), 1333-1336,. IEEE,
(2016).
[22] Horasan, F., Erbay, H., Varçın, F., & Deniz, E. Alternate
Low-Rank Matrix Approximation in Latent Semantic
Analysis.”, Scientific Programming, (2019).
[23] Martin, D. I., & Berry, M. W., Mathematical foundations
behind latent semantic analysis.”, Handbook of latent
semantic analysis, 35-55, (2007).
[24] Berry, M. W., & Fierro, R. D., Lowrank Orthogonal
Decompositions for Information Retrieval
Applications.”, Numerical linear algebra with
applications, 3(4): 301-327, (1996).
[25] Duman E., Erbay H., Latent semantic analysis approach
for automatic classification of web pages contents.”,
Master Thesis, (2013).
[26] Shima, K., Todoriki, M., & Suzuki, A., SVM-based
feature selection of latent semantic features.”, Pattern
Recognition Letters, 25(9): 1051-1057, (2004).
[27] Uysal, A. K., & Gunal, S. Text classification using genetic
algorithm oriented latent semantic features. Expert
Systems with Applications, 41(13): 5938-5947, (2014).
[28] Jessup, E. R., & Martin, J. H.,“Taking a new look at the
latent semantic analysis approach to information
retrieval.”, Computational information retrieval, 121-
144, (2001)
... The rapid growth of Arabic text data has created an urgent need for advanced tools capable of quickly and accurately detecting the most significant terms within a corpus [1]. Automated keyword detection [2], [3] is a crucial process for various applications, including document summarization [4], search engine optimization [5], and information retrieval [6]. The literature review reveals a variety of methodologies utilized in automated keyword detection solutions, including statistical [7], [8], linguistic [9], [10], machine learning [11]- [15], graphbased [16]- [18], hybrid approaches [19], and large language models [20]. ...
Article
Full-text available
The exponential growth of Arabic text data in recent years has created an urgent demand for sophisticated keyword detection techniques that are specifically tailored to the nuances of the Arabic language. This study addresses the critical need for efficient tools capable of swiftly and accurately identifying keywords within a collection of Arabic documents, particularly when analyzing multiple documents in a corpus. To meet this challenge, we present a novel corpus specifically designed for keyword detection in Arabic texts, along with an innovative approach that integrates three distinct candidate keyword lists: a frequency-based list, a vector space model list, and a machine learning-based list. This hybrid methodology leverages the strengths of each technique, enabling a more comprehensive and effective keyword identification process. We conducted extensive experimental validation to assess the performance and computational efficiency of our proposed pipeline. The results demonstrate that our approach consistently achieves robust performance across a variety of domains, with evaluation metrics indicating F1-scores that consistently surpass 91%. Overall, this study contributes to the advancement of automated keyword detection in Arabic, paving the way for enhanced information retrieval and text analysis capabilities.
... This tool leveraged the 'term frequency-inverse document frequency' weighting algorithm, coupled with subsequent processes like 'singular value decomposition' and 'spherical K-means' for optimal content display. In 2020, Horasan unveiled the potency of Latent Semantic Analysis (LSA) for keyword extraction in SEO [13]. Through LSA, the relationship between documents or sentences and the terms contained within were modeled, resulting in cost-effective strategies to target specific online audiences. ...
Chapter
Full-text available
In recent years the digital landscape has been rapidly evolving as the application of artificial intelligence (AI) becomes increasingly important in shaping search engine optimization (SEO) strategies and revolutionizing the way websites are optimized for search engines. This research aims to explore the influence of AI in the field of SEO through a literature review that is conducted using the PRISMA framework. The study delves into how AI capabilities such as generative AI and natural language processing (NLP) are leveraged to boost SEO. These techniques in turn allow search engines to provide more accurate, user-centric results, highlighting the importance of semantic search, where search engines understand the context and intent of a user’s search query, ensuring a more personalized and effective search experience. On the other hand, AI and its tools are used by digital marketers to implement SEO strategies such as automatic keyword research, content optimization, and backlink analysis. The automation offered by AI not only enhances efficiency but also heralds a new era of precision in SEO strategy. The application of AI in SEO paves the way for more targeted SEO campaigns that attract more organic visits to business websites. However, relying on AI in SEO also poses challenges and considerations. The evolving nature of AI algorithms requires constant adaptation by businesses and SEO professionals, while the black-box nature of these algorithms can lead to the opaque and unpredictable evolution of SEO results. Furthermore, the power of AI to shape online content and visibility raises questions about equality, control, and manipulation in the digital environment. The insights gained from this study could inform future developments in SEO strategies, ensuring a more robust, fair, and user-centric digital search landscape.
Article
Full-text available
This research aims to optimize Search Engine Optimization (SEO) strategies to increase the visibility of Islamic financial institution products on the internet. With a quantitative descriptive approach, this research analyzes the effectiveness of SEO implementation using secondary data obtained from SEO analysis tools such as Google Keyword Planner and SEMrush. The data collected includes keyword search volume, difficulty level, as well as backlinks and technical SEO performance (page speed and mobile optimization). Descriptive statistical techniques are used to describe SEO trends, while correlation analysis evaluates the relationship between SEO variables and the digital visibility of Islamic financial institutions. The research results show that comprehensive SEO implementation including on-page SEO, off-page SEO, and technical SEO is able to increase website rankings in search engines, increase organic traffic, brand awareness, and customer conversions. The case example of Dubai Islamic Bank (DIB) illustrates how keyword optimization and quality backlinks contribute significantly to increasing digital visibility. These findings indicate that Islamic financial institutions in Indonesia can take advantage of search trends that are still low in competition to strengthen their position in search engines, through an integrated and well-targeted SEO strategy.
Article
Full-text available
The latent semantic analysis (LSA) is a mathematical/statistical way of discovering hidden concepts between terms and documents or within a document collection (i.e., a large corpus of text). Each document of the corpus and terms are expressed as a vector with elements corresponding to these concepts to form a term-document matrix. Then, the LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection. This is an essential step in the LSA. The singular value decomposition (SVD) has been the main tool obtaining the low-rank approximation in the LSA. Since the document collection is dynamic (i.e., the term-document matrix is subject to repeated updates), we need to renew the approximation. This can be done via recomputing the SVD or updating the SVD. However, the computational time of recomputing or updating the SVD of the term-document matrix is very high when adding new terms and/or documents to preexisting document collection. Therefore, this issue opened the door of using other matrix decompositions for the LSA as ULV- and URV-based decompositions. This study shows that the truncated ULV decomposition (TULVD) is a good alternative to the SVD in the LSA modeling.
Article
Full-text available
In this study, we describe a keyword extraction technique that uses latent semantic analysis (LSA) to identify semantically important single topic words or keywords. We compare our method against two other automated keyword extractors, Tf-idf (term frequency-inverse document frequency) and Metamap, using human-annotated keywords as a reference. Our results suggest that the LSA-based keyword extraction method performs comparably to the other techniques. Therefore, in an incremental update setting, the LSA-based keyword extraction method can be preferably used to extract keywords from text descriptions from big data when compared to existing keyword extraction methods.
Article
Full-text available
Users may strive to formulate an adequate textual query for their information need. Search engines assist the users by presenting query suggestions. To preserve the original search intent, suggestions should be context-aware and account for the previous queries issued by the user. Achieving context awareness is challenging due to data sparsity. We present a probabilistic suggestion model that is able to account for sequences of previous queries of arbitrary lengths. Our novel hierarchical recurrent encoder-decoder architecture allows the model to be sensitive to the order of queries in the context while avoiding data sparsity. Additionally, our model can suggest for rare, or long-tail, queries. The produced suggestions are synthetic and are sampled one word at a time, using computationally cheap decoding techniques. This is in contrast to current synthetic suggestion models relying upon machine learning pipelines and hand-engineered feature sets. Results show that it outperforms existing context-aware approaches in a next query prediction setting. In addition to query suggestion, our model is general enough to be used in a variety of other applications.
Article
Full-text available
Web search engine has become a very important tool for finding information efficiently from the massive Web data. Based on PageRank algorithm, a genetic PageRank algorithm (GPRA) is proposed. With the condition of preserving PageRank algorithm advantages, GPRA takes advantage of genetic algorithm so as to solve web search. Experimental results have shown that GPRA is superior to PageRank algorithm and genetic algorithm on performance. (C) 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [CEIS 2011]
Article
The amount of big data collected during human–computer interactions requires natural language processing (NLP) applications to be executed efficiently, especially in parallel computing environments. Scalability and performance are critical in many NLP applications such as search engines or web indexers. However, there is a lack of mathematical models helping users to design and apply scheduling theory for NLP approaches. Moreover, many researchers and software architects reported various difficulties related to common NLP benchmarks. Therefore, this paper aims to introduce and demonstrate how to apply a scheduling model for a class of keyword extraction approaches. Additionally, we propose methods for the overall performance evaluation of different algorithms, which are based on processing time and correctness (quality) of answers. Finally, we present a set of experiments performed in different computing environments together with obtained results that can be used as reference benchmarks for further research in the field.
Conference Paper
Latent semantic analysis (LSA) usually uses the singular value decomposition (SVD) of the term-document matrix for discovering the latent relationships within the document collection. With the SVD, by disregarding the smaller singular values of the term-document matrix a vector space cleaned from noises that distort the meaning is obtained. The latent semantic structure of the terms and documents is obtained by examining the relationship of representative vectors in the vector space. However, the computational time of re-computing or updating the SVD of the term-document is high when adding new terms and/or documents to pre-existing document collection. Thus, the need a method not only has low computational complexity but also creates the correct semantic structure when updating the latent semantic structure is arisen. This study shows that the truncated ULV decomposition is a good alternative to the SVD in LSA modelling about cost and producing the correct semantic structure.
Article
To make the search engine more user-friendly, commercial search engines commonly develop applications to provide suggestion or recommendation for every posed query. Clustering semantically similar queries acts as an essential prerequisite to function well in those applications. However, clustering queries effectively is quite challenging, since they are usually short, incomplete and ambiguous. Existing prevalent clustering methods, such as K-Means or DBSCAN cannot guarantee good performance in such a highly dimensional environment. Through analyzing users’ click-through query logs, hierarchical agglomerative clustering gives good results but is computationally quite expensive. This paper identifies a novel feature for clustering search queries based on a key insight – queries’ top ranked search results can themselves be used to quantify query similarity. After investigating such feature, we propose a new similarity metric for comparing those diverse queries. This facilitates us to develop two very efficient and accurate algorithms integrated in query clustering. We conduct comprehensive experiments to compare the accuracy of our approach against the known baselines along two dimensions: 1) quantifying the cohesion/separation of clustered queries, and 2) justifying the results by real-world Internet users. The experimental results demonstrate that our two algorithms and the similarity metric can generate more accurate results within a significantly shorter time.
Article
From the Publisher: As the Internet has become flooded with untrustworthy information, some of which is intentionally misleading or erroneous, this book teaches Web surfers how inaccurate data can affect their health, privacy, investments, business decisions, online purchases, and legal affairs. Bringing together the world's leading information-age observers, analysts, writers, and practitioners, this analysis reveals the Web as fertile ground for deception and misinformation. These experts provide hard-won advice on how to recognize misinformation in its myriad forms and disguises. Included are an array of tips on how to evaluate Web sites for quality and bias, checklists for navigating the Internet more effectively, and advice for those who have been duped. Author Biography: Anne P. Mintz has taught online database searching at the Columbia University Graduate School of Library Service. She is currently the director of knowledge management at Forbes, Inc. She lives in New York City.
Article
Current methods to index and retrieve documents from databases usually depend on a lexical match between query terms and keywords extracted from documents in a database. These methods can produce incomplete or irrelevant results due to the use of synonyms and polysemus words. The association of terms with documents (or implicit semantic structure) can be derived using large sparse {\it term-by-document} matrices. In fact, both terms and documents can be matched with user queries using representations in k-space (where 100 ≤ k ≤ 200) derived from k of the largest approximate singular vectors of these term-by-document matrices. This completely automated approach called latent semantic indexing or LSI, uses subspaces spanned by the approximate singular vectors to encode important associative relationships between terms and documents in k-space. Using LSI, two or more documents may be closeto each other in k-space (and hence meaning) yet share no common terms. The focus of this work is to demonstrate the computational advantages of exploiting low-rank orthogonal decompositions such as the ULV (or URV) as opposed to the truncated singular value decomposition (SVD) for the construction of initial and updated rank-k subspaces arising from LSI applications.