ArticlePDF Available

An intrinsic evaluator for embedding methods in malicious URL detection

Springer Nature
International Journal of Information Security
Authors:

Abstract and Figures

Nowadays, machine learning is used in many fields. Not only in fields such as image recognition, machine learning is also used for malicious detection. Especially in recent years, there have been many studies using machine learning for malicious URL detection to replace traditional blacklists. In order to compare the performance of the malicious URLs detection method, researches used the F-score or other detection accuracy to evaluate, but there are some difficulties in evaluating the URL embedding method used in malicious URLs detection because the detection accuracy is also effect by machine learning or deep learning models and data sets. An evaluation method of URL embedding method that is not affected by other factors is particularly important. In this paper, we proposed an intrinsic evaluation method for URL embedding method that is not affected by machine learning models or deep learning models and data sets. Besides, We analyse some URL embedding methods according to intrinsic and extrinsic methods and offer a guidance in selecting suitable embedding methods in URL by analysing the results.
Content may be subject to copyright.
International Journal of Information Security (2025) 24:36
https://doi.org/10.1007/s10207-024-00950-9
REGULAR CONTRIBUTION
An intrinsic evaluator for embedding methods in malicious URL
detection
Qisheng Chen1·Kazumasa Omote1
© The Author(s) 2024, corrected publication 2024
Abstract
Nowadays, machine learning is used in many fields. Not only in fields such as image recognition, machine learning is also
used for malicious detection. Especially in recent years, there have been many studies using machine learning for malicious
URL detection to replace traditional blacklists. In order to compare the performance of the malicious URLs detection method,
researches used the F-score or other detection accuracy to evaluate, but there are some difficulties in evaluating the URL
embedding method used in malicious URLs detection because the detection accuracy is also effect by machine learning or
deep learning models and data sets. An evaluation method of URL embedding method that is not affected by other factors
is particularly important. In this paper, we proposed an intrinsic evaluation method for URL embedding method that is not
affected by machine learning models or deep learning models and data sets. Besides, We analyse some URL embedding
methods according to intrinsic and extrinsic methods and offer a guidance in selecting suitable embedding methods in URL
by analysing the results.
Keywords Malicious URLs detection ·URL embedding evaluation ·URL ·Network security
1 Introduction
With the development of machine learning and deep learn-
ing, machine learning and deep learning are also used to
detect malicious URLs. In these methods, in order to be
able to convert the URL as a string into a number column
that can be recognized by machine learning or deep learn-
ing, like the natural language processing, it will segment the
URL and embed the URLs into the feature vectors. Chen’s
research shows most malicious URLs detection methods
use embedding, segmentation methods, and machine learn-
ing algorithms, which means either segmentation method,
embedding method, or machine learning model will affect
the performance of malicious URL detection method.
The preliminary version of this paper was presented at IFIPSEC 2023
[1]. This paper added more relative research of malicious URLs
detection and used our new evaluation method to evaluated these
relative research.
BQisheng Chen
s2230146@u.tsukuba.ac.jp
Kazumasa Omote
omote@risk.tsukuba.ac.jp
1University of Tsukuba, Tsukuba, Ibaraki 305-8577, Japan
The feature of the method of using machine learning to
detect malicious URLs is that it can detect malicious URLs
efficiently under the premise of a low false detection rate. In
this case, the accuracy of malicious URLs detection method is
an important evaluation index. For this reason, researches on
malicious URLs detection method based on machine learning
focus on increasing the accuracy of detection.
As an important part of malicious URLs detection meth-
ods, the method of turning URLs into feature vectors which
call URL embedding method will also significantly affect the
performance of malicious URL detection methods. However,
the only way to evaluate the performance of URL embed-
ding method is the accuracy result after training the machine
learning model in related research. The accuracy of mali-
cious URLs detection is not only based on the performance
of the detection methods, but is also related to training sets
and test sets. In other words, the accuracy of malicious URLs
detection methods will change due to different test sets, so it
is not comprehensive to evaluate URL embedding methods
only from the detection accuracy of single test sets.
To solve this problem, the evaluation of another aspect,
in addition to accuracy, becomes particularly important. The
evaluation method focus on the embedded feature vectors
called intrinsic evaluating method. Unlike extrinsic evaluat-
0123456789().: V,-vol 123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
36 Page 2 of 11 C. Qisheng, O. Kazumasa
ing method, it does not depend on the other part of detection
and the training test set. Because the embedding method is
the only variable, the advantage is not to worry about the
impact of other variables.
The main contributions of this paper are shown as follows:
1. We proposed an intrinsic evaluation method for URL
embedding method based on cosine similarity. The intrin-
sic evaluation method can evaluate URL embedding
method without the effect of machine learning models
and data sets.
2. Besides, we evaluated several URL embedding methods
with intrinsic and extrinsic method and found that the
traditional extrinsic evaluation methods have some diffi-
culties in evaluating URL embedding methods and proved
the intrinsic method’s usefulness.
3. At last, we offered guidance in selecting suitable embed-
ding method in malicious URLs detection according to
the results of the evaluation.
The structure of this paper as below: In the Preliminary
section, we will introduce the important algorithms and URL
embedding methods used in our work,and we will introduce
some related research including malicious URLs detection
and evaluating method in the Related Work section. The
structure of intrinsic evaluation method is explained in the
Sects. 4and 5shows the whole process of evaluation includ-
ing extrinsic and intrinsic. The Evaluation Section contains
experimental data as well as experimental results and results
analysis. At last, we discuss some problem have not solved
of URL embedding method and evaluation method.
2 Preliminary
In this section, we will introduce F-1 score and cosine similar-
ity, which will be used as the indicator of extrinsic evaluation
method and the evaluation algorithm of intrinsic evaluation
method. Besides, we will introduce the URL embedding
methods used in our test.
2.1 F-1 score
F-1 score is the harmonic mean of the precision: attempts to
answer the question that what proportion of positive identi-
fications was actually correct; and recall: attempts to answer
the question that what proportion of actual positives was iden-
tified correctly:
Precision=tp
tp +fp (1)
Recall =tp
tp +fn (2)
F=2Precision ·Recall
Precision +Recall (3)
2.2 Cosine similarity
We used cosine similarity as an indicator of measuring how
much information is retained. Specifically, the URLs are
embedded as vectors in an inner product space and the cosine
similarity is defined as the cosine of the angle between two
vectors, that is, the dot product of the vectors divided by the
product of their lengths.
S=vx·vy
vx vy(4)
Asshowninalgorithm4,vxand vyare two feature vectors
and vxand vymean their L2 norm. The advantage of
cosine similarity is its low complexity, especially for sparse
vectors: only the non-zero coordinates must be considered.
Cosine similarity represents the relationship between two
Tokens; when cosine similarity is close to 1, it means that
the two Tokens are very similar in the meaning of embed-
ding method. On the contrary, if cosine similarity is close to
0, the two Tokens are not similar in the sense of embedding
method. On the contrary, if cosine similarity is close to 0,
the two Tokens are not similar in the sense of embedding
method.
2.3 URL embedding methods
The method that turns the URL into feature vectors that can
be trained is called embedding method. In this section, we
will introduce several famous embedding methods by divid-
ing them into context-considering embedding methods and
context-agnostic embedding methods.
2.3.1 Context-considering embedding methods
Context-considering embedding methods means the gen-
eration of word vectors takes into account the context of
the corpus. Like the algorithms CBOW and Skip-gram in
Word2Vec, they can predict a word based on context or pre-
dict the context based on a word. When they change words
into word vectors, they will consider their context, which
will increase the accuracy of prediction. In this paper, we
used Word2Vec [2], FastText [3], GloVe [4] as the target
context-considering URL embedding methods.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
An intrinsic evaluator for embedding… Page 3 of 11 36
2.3.2 Context-agnostic embedding methods
Context-agnostic embedding methods like One-hot Code and
TF-IDF [5] are the basic embedding methods. They turn
words into vectors easily, and they embed words only by
using the word’s quantity or physical order.
3 Related work
The Related Work section will introduce some research about
the embedding evaluator in NLP, and the malicious URLs
detector used URL embedding and machine leaning refer to
the survey of G. Pradeepa, and R. Devi, they summarized 12
related research on malicious URL detection using machine
learning in "Review of Malicious URL Detection Using
Machine Learning" [6]. They summarized the research’s
machine learning models and features, which is very helpful
for investigation of relevant research.
3.1 A three-step framework for detecting malicious
URLs
Chen [7] proposed a three-steps framework to review 14
methods of detecting malicious URLs. They divided the
method of malicious URL detection using machine learn-
ing into three parts: Segmentation, Embedding, and Machine
learning. They evaluated some machine learning models
and context-considering methods by three-step framework,
and they verified the importance of considering context
and found that context-considering embedding methods are
more important and the malicious URLs detection accuracy
improved by about 6% with context-considering methods.
Chen’s research uses F-1 score to evaluate the suitability of
each embedding method and malicious URL detection meth-
ods according to the specific malicious URL detection task.
However, once the training set and test set of malicious URL
detection task change, F-1 score will also change, which will
affect the evaluation results. In this case, their evaluation of
embedding methods is incomplete.
3.2 The extrinsic and intrinsic evaluating method in
NLP
Wang [8] categorizes the NLP evaluators into intrinsic and
extrinsic two types. Intrinsic evaluators test the quality
of a representation independent of specific natural lan-
guage processing tasks, while extrinsic evaluators use word
embeddings as input features to a downstream task and mea-
sure changes in performance metrics specific to that task.
Although the Token split by URL embedding in the Segmen-
tation step is different from natural language processing, and
the Token is not a Word in the language sense, because the
process and method of URL embedding and Word embed-
ding are similar, we can refer to the evaluation method of
Word embedding.
3.3 The segmentation methods of the malicious
URLs detection research
The method proposed by Yuan et al. [9] named URL2Vec:
“URL Modeling with Character Embeddings for Fast and
Accurate Phishing Website Detection”is a typical research
that uses machine learning to detect malicious URLs. They
divided the URLs by the structure of URL protocol, sub-
domain name, domain name, domain suffix, and URL path
5 parts.
The method proposed by Kaneko et al. [10] named
“Detecting Malicious Websites by Query Templates”used the
machine learning algorithm DBSCAN to cluster malicious
URLs and benign URLs. In the segmentation step, they chose
a different way to divide URLs is that use all delimiters into
URLs. Each part of the split URL were called a Token and
we call this method as Token segmentation method, and we
used the method to split URL in this paper.
The method“Learning a URL Representation with Deep
Learning for Malicious URL Detection”named URLnet,
which was proposed by Le et al. [11], trained a Convo-
lutional Neural Network model to detect malicious URLs
obtain a good results. They proposed two different methods
to divide URLs, Char-level-CNN separates the URL by each
letter, which we called the Alphabet segmentation method.
The Word-level-CNN selected the separators“/”,“.”and“-”as
the benchmark to divide the URLs.
3.4 The URL embedding methods of the malicious
URLs detection research
As the embedding step of URL2Vec, each part of URL
were embedded by using Skip-Gram as feature vectors. The
method proposed by Joshua et al. [12] name eXpose:“A
Character-Level Convolutional Neural Network with Embed-
dings For Detecting Malicious URLs, File Paths and Registry
Keys”and the URLnet used one-hot code embedding method.
The new embedding model proposed by Yan et al. [13] named
UE Model propoesed a new URL embedding method that
used Huffman Code and Huffman Tree to embed URLs. The
phishing URL detection system proposed by Ozgur et al. [14]
makes improvements on the basis of previous research, they
used the NLP-based features to considered more word infor-
mation. The malicious URLs detection system proposed by
Cho et al. [15], Ripon et al. [16], Kamel et al. [17], Yogen-
dra et al. [18], Patil et al. [19], Ammara et al. [20], Ferhat
et al. [21], Mohammad et al. [22] are typical detection sys-
tems based on Feature Engineering. They used the number
of characters ’.’, the number of subdomain levels, the length
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
36 Page 4 of 11 C. Qisheng, O. Kazumasa
of the URL, and a series of lexical features to train machine
learning and deep learning models.
4 Intrinsic evaluation method
Intrinsic evaluation methods focus on the embedding perfor-
mance of URL embedding methods. They test the quality
of a representation independent of specific malicious URLs
detection tasks and they measure the relationships among
domains in the URL directly. In other words, the embed-
ded feature vector contains the relative information of the
URL Token, and the accuracy of the amount of information
retained after changing the URL string into a digital string
reflects the performance of URL embedding method.
4.1 Intrinsic score
With the premise in Sect.2.2, we can know that if two Tokens
have similar meanings in the URL, and their cosine similarity
is close to 1, it means that the two Tokens are well embedded.
The method to evaluate a group of Token’s similarity is to
calculate their average value, as shown in algorithm 5, which
means the Tokens in a group calculate cosine similarity with
each other and take their average value. Because the Tokens
in the group are similar to each other, the closer SSimi lar is
to 1, the better they are embedded.
SSimi lar =1
n(n1)
vxA,vyA,vx=vy
S(vx,v
y)(5)
On the other hand, if two Tokens are not similar in URL
meaning, and their cosine similarity is close to 0, it also means
that the two Tokens are well embedded. The method is similar
to algorithm 5 but it needs two groups of Tokens and group
A is not similar to group B in the meaning of URL. As shown
in algorithm 6, it calculates the average of cosine similarity
of group A with group B because the Tokens in group A are
not similar to the Tokens in group B, so the closer SDi ssi milar
is to 0, the better they are embedded.
SDissimilar =1
n(n1)
vxA,vyB
S(vx,v
y)(6)
More expansion, there are three characteristics of embed-
ded well: SSi milar close to 1, and SDissimilar close to 0, and
the difference between SSim ilar and SDi ssimilar is large, so
we propose the following algorithm to evaluate the perfor-
Table 1 Similar token set
Amazon Google Reddit Youtube Facebook
Taobao Yahoo Twitter Microsoft Sina
Weibo Adobe Zoom Xinhuanet Ebay
Table 2 Dissimilar token set
com co net htm html
sports finance news blog www
shtml search exe edu index
mance of URL embedding method, the larger the score, the
better performance of embedding:
Score =(100 ·SSi mil ar 100 ·SDissi mil ar)2
+100 ·SSimilar
(7)
4.2 URL token pair
In order to verify the relationship between two feature vec-
tors, we need a pair of URL Tokens that already know
their relationship. Likes Token“amazon”and“google”, they
usually play the role of the domain name in URL“www.
amazon.com”or www.google.com”so they should be simi-
lar in either URL or vector. We collected the top 50 domains
in AlexaTop and looking forward to selecting 15 of them to
form a similar Token set. We calculated the cosine similar-
ity of these Tokens with the embedding methods Word2Vec,
FastText and GloVe to ensure they are not only similar in
the meaning of domain but also similar in the meaning of
embedding. After manually selecting and verifying by dif-
ferent embedding methods, we form a similar Token set
shown in Table 1. Besides, we were also looking forward
to selecting 15 Tokens which dissimilar from the Tokens in
a similar Token set like the domain suffix part in URL, such
as“www”or“com”. After selection and verification, the dis-
similar Token set is shown in Table 2.
5 Evaluation process
5.1 Process of extrinsic evaluation
Extrinsic evaluating method uses URL embedding method
as input features to a downstream task and measures changes
in performance metrics specific to that task, which means we
set up a specific malicious URLs detection task as a down-
stream task, and we used several indicators for evaluating the
performance of malicious URL detection methods.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
An intrinsic evaluator for embedding… Page 5 of 11 36
Fig. 1 Process of extrinsic
evaluation
Fig. 2 Process of intrinsic
evaluation
Figure 1shows the outline of the process of extrinsic eval-
uation. The origin URL will be split by the segmentation
method into URL Token, and then the URL Token will
be embedded into the feature vectors according to differ-
ent embedding methods. The machine learning model will
be trained with the feature vectors, after training the output
model can predict the URL used for testing. In order to eval-
uate different URL embedding methods, we changed several
methods in the embedding step and machine learning step,
including Random Forest [23], LightGBM [24], Decision
Tree, Logistic Regression, and CNN. Besides, the dimension
also be set as a variable.
5.2 Process of intrinsic evaluation
Figure 2shows the outline of the process of intrinsic evalu-
ation. We split the URL in the corpus by all delimiters into
URL, and set up the embedding method by using the corpus.
As experimental subjects, we used Word2Vec/Skip-gram,
Word2Vec/CBOW, FastText/Skip-gram, FastText/CBOW,
GloVe, TF-IDF, and One-hot Code. Then the embedding
method to be evaluated will calculate SSimi lar and SDissimi lar
with a similar Token set and a dissimilar Token set mentioned
in Sect. 4.2.
6 Evaluation
In this section, we will show and analyze the results obtained
according to the evaluation process described in the Sect.5.
6.1 Data set
The extrinsic evaluating method requires a complete set of
malicious URL detection tasks, so we have prepared an URL
set for use as a corpus and the training test set for train-
ing and detection. We set up a crawling program to crawl
140 thousands URLs from AlexaTop [25], a website that
counts the most used domain names made by Amazon. We
selected 5 thousands malicious URLs with classic URL struc-
ture in URLhaus [26], a manually maintained malicious URL
database as malicious URLs set and we selected 5 thousands
benign URLs in the crawl results with classic URL structure.
The training set and test set will be produced by the random
seed of cross-validation from the malicious URLs and benign
URLs mentioned above. In addition, to demonstrate that the
extrinsic method changes with the dataset, we used different
random seeds to construct different datasets for experimenta-
tion. One experiment used the same number of malicious and
benign URLs for training and testing, while the other exper-
iment used a dataset of benign: malicious behavior 1:10.
6.2 Experiment results
We took several URL embedding methods as variables and
tested them with extrinsic and intrinsic evaluation methods.
Besides, we compared vectors embedded in 64 and 2 dimen-
sions because higher dimensional vectors usually contain
more information than lower dimensional vectors, and high
dimensional vectors can train with more features than low
dimensional vectors, which can improve prediction accu-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
36 Page 6 of 11 C. Qisheng, O. Kazumasa
Table 3 Extrinsic score (F-1 score) comparison of URL embedding
method with 50% ratio of benign/malicious
ML 64D 2D
Word2Vec/Skip-gram RF 0.98739 0.98810
LGBM 0.98966 0.98575
DT 0.97963 0.98137
LR 0.98211 0.97365
CNN 0.99245 0.99023
Word2Vec/CBOW RF 0.98643 0.98896
LGBM 0.98818 0.98723
DT 0.98060 0.98317
LR 0.97777 0.97335
CNN 0.98381 0.98121
FastText/Skip-gram RF 0.98603 0.98416
LGBM 0.98400 0.97894
DT 0.96715 0.95924
LR 0.97832 0.96822
CNN 0.99490 0.98260
FastText/CBOW RF 0.98411 0.98690
LGBM 0.98265 0.98506
DT 0.96547 0.96747
LR 0.96009 0.95026
CNN 0.98566 0.98209
GloVe RF 0.99473 0.99378
LGBM 0.99675 0.99591
DT 0.97900 0.96089
LR 0.96454 0.95239
CNN 0.98654 0.97124
TF-IDF RF 0.97811 0.97591
LGBM 0.95019 0.96047
DT 0.91870 0.90129
LR 0.94152 0.93281
CNN 0.92974 0.90853
One-hot Code RF 0.93805 0.92917
LGBM 0.92492 0.92386
DT 0.88136 0.87542
LR 0.90179 0.89027
CNN 0.90295 0.88132
racy. Tables 3and 4show the F-1 score of both data sets
and Tables 5and 6show the AUC score of the extrinsic eval-
uation mentioned in Sect. 5.1, and Table 7shows the results
of the intrinsic evaluation mentioned in Sect.5.2. The Table 9
shows the 64D SSim ilar and SDi ssimilar of each URL embed-
ding methods.
Table 4 Extrinsic score (F-1 score) comparison of URL embedding
method with 10% ratio of benign/malicious
ML 64D 2D
Word2Vec/Skip-gram RF 0.99126 0.98972
LGBM 0.99062 0.99289
DT 0.97675 0.98433
LR 0.98799 0.98433
CNN 0.99351 0.98713
Word2Vec/CBOW RF 0.99100 0.99015
LGBM 0.99228 0.99203
DT 0.98256 0.98482
LR 0.98460 0.98380
CNN 0.98611 0.98387
FastText/Skip-gram RF 0.98939 0.98904
LGBM 0.98938 0.98933
DT 0.97246 0.96915
LR 0.98669 0.97858
CNN 0.99101 0.98988
FastText/CBOW RF 0.98690 0.98793
LGBM 0.98896 0.98963
DT 0.97764 0.97525
LR 0.97785 0.97593
CNN 0.98007 0.98105
GloVe RF 0.99313 0.99422
LGBM 0.99481 0.99430
DT 0.99192 0.98981
LR 0.99281 0.99207
CNN 0.99531 0.99501
TF-IDF RF 0.96198 0.96023
LGBM 0.94102 0.94366
DT 0.92699 0.92061
LR 0.93968 0.93298
CNN 0.94811 0.93177
One-hot Code RF 0.92909 0.92217
LGBM 0.91088 0.91271
DT 0.88159 0.87195
LR 0.90333 0.90167
CNN 0.91800 0.90619
6.3 Results analyse
As shown in Tables 3,4,5,6and 7, both extrinsic eval-
uation results and intrinsic evaluation results, the context-
considering embedding methods are better than context-
agnostic embedding methods, means not only in NLP but
also in URL embedding, considering context is essential.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
An intrinsic evaluator for embedding… Page 7 of 11 36
Table 5 Extrinsic score (AUC score) comparison of URL embedding
method with 50% ratio of benign/malicious
ML 64D 2D
Word2Vec/Skip-gram RF 0.99863 0.99900
LGBM 0.99858 0.99825
DT 0.97858 0.98311
LR 0.99647 0.99263
CNN 0.99877 0.99830
Word2Vec/CBOW RF 0.99889 0.99904
LGBM 0.989857 0.99833
DT 0.98231 0.98414
LR 0.99420 0.99269
CNN 0.99712 0.99881
FastText/Skip-gram RF 0.99878 0.99869
LGBM 0.99835 0.99769
DT 0.96595 0.95747
LR 0.99767 0.99404
CNN 0.99912 0.99880
FastText/CBOW RF 0.99847 0.99889
LGBM 0.99835 0.99857
DT 0.96445 0.96599
LR 0.99081 0.98670
CNN 0.99810 0.99766
GloVe RF 0.99937 0.99916
LGBM 0.99941 0.99933
DT 0.98221 0.98077
LR 0.99383 0.99220
CNN 0.99879 0.99855
TF-IDF RF 0.97101 0.96912
LGBM 0.95607 0.95553
DT 0.90017 0.90690
LR 0.93443 0.93884
CNN 0.92604 0.92003
One-hot Code RF 0.93660 0.92719
LGBM 0.92386 0.92942
DT 0.88099 0.87439
LR 0.89972 0.89921
CNN 0.90188 0.89613
6.4 The evaluation of related works
We have summarized 14 related research using machine
learning for malicious URL detection in Sect. 3, and we used
intrinsic evaluation method and extrinsic evaluation method
to evaluate the related research’s detection method in this
section. As shown in Table8, the related research can be
divided into 5 parts, and due to the use of many different
machine learning algorithms in many related studies, in order
to achieve unified evaluation, we will use the random for-
Table 6 Extrinsic score (AUC score) comparison of URL embedding
method with 10% ratio of benign/malicious
ML 64D 2D
Word2Vec/Skip-gram RF 0.99529 0.99592
LGBM 0.99462 0.99509
DT 0.91059 0.94403
LR 0.99178 0.98367
CNN 0.99263 0.99313
Word2Vec/CBOW RF 0.99560 0.99576
LGBM 0.99459 0.99452
DT 0.93980 0.93291
LR 0.98760 0.98430
CNN 0.99393 0.99256
FastText/Skip-gram RF 0.99476 0.99503
LGBM 0.99339 0.99436
DT 0.91696 0.88137
LR 0.99103 0.98237
CNN 0.99591 0.99583
FastText/CBOW RF 0.99330 0.99556
LGBM 0.99508 0.99371
DT 0.89816 0.90661
LR 0.98366 0.98122
CNN 0.99585 0.99219
GloVe RF 0.99337 0.99382
LGBM 0.99440 0.99341
DT 0.92760 0.91684
LR 98153 0.97864
CNN 0.99078 0.98875
TF-IDF RF 0.96893 0.96119
LGBM 0.94795 0.94752
DT 0.90792 0.90775
LR 0.93167 0.93496
CNN 0.94680 0.93606
One-hot Code RF 0.93613 0.93690
LGBM 0.91190 0.91587
DT 0.87702 0.87622
LR 0.90447 0.90739
CNN 0.91591 0.91148
Table 7 Intrinsic score comparison of URL embedding METHods
64D 2 D
Word2Vec/Skip-gram 564 121
Word2Vec/CBOW 152 156
FastText/Skip-gram 450 148
FastText/CBOW 109 118
GloVe 438 157
TF-IDF 89 152
One-hot Code 97 96
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
36 Page 8 of 11 C. Qisheng, O. Kazumasa
Table 8 Evaluation of related works
Research Embedding method Machine learning Intrinsic Extrinsic
[11], [12] One-hot code CNN 97 0.96631
[9] Skip-gram Random forest 564 0.99675
[13] Huffman code Random forest 378 0.98651
[14] NLP based features Random forest 109 0.97998
[15], [16], [17], [18], [19], [20], [21], [22] Numerical Random forest 83 0.95208
est algorithm uniformly. Besides, these related studies have
also used different segmentation methods such as Alphabet,
Word, Token, etc. Here, we uniformly use Token segmen-
tation method. The reason is that, according to our previous
study, the Token segmentation method is context-considering
segmentation method, and the actual detection accuracy is
one of the highest among all segmentation methods. Sec-
ondly, the Token Pairs used in this study for the Intrinsic
Evaluation Method are split by Token segmentation method
and can be detected by the Intrinsic Evaluation Method.
As shown in Table 8, the research “URLnet”[11] and
“eXpose” [12] used the One-hot Code, kind of context-
agnostic embedding method for embedding, obtained almost
lowest extrinsic score in related works for the reason that
context-agnostic embedding method do not including con-
textual information during embedding which also mentioned
in Sect. 2.3.2. Not only extrinsic score, but also from intrinsic
score, we can clearly see that the embedding performance
of One-hot Code method lags far behind other context-
considering embedding methods. Given that these two related
studies were proposed relatively early, they provide new ideas
for the study of other machine learning based malicious URL
detection methods.
The research“URL2Vec”[9] and“UE Model” [13] achieved
the highest scores in related works. The Skip-gram embed-
ding method is a famous context-considering model and it
embeds Tokens with similar meanings into similar vector
spaces. During the detection process, the distance in the
vector space plays a significant role in predicting URL prop-
erties.
The other research used the Numerical embedding method
that is not a word embedding algorithm, they use traditional
feature embedding methods, using features such as character
length, number of characters, etc. to input machine learn-
ing models for detection. This traditional method is more
commonly used in network attack detection such as DDoS
detection. This embedding method obtained the lowest score,
indicating that it is not suitable for current malicious URL
detection.
Besides, the research using context-considering embed-
ding methods typically yields higher detection accuracy,
which is the result of the Extrinsic Evaluation Method. How-
ever, the results of the Extrinsic Evaluation Method are
generally not significantly different, and often, in order to
make the difference more obvious, research will reduce the
training set or training frequency to make the difference in
detection accuracy become apparent. Although this approach
can make the difference in detection accuracy between differ-
ent methods more apparent, the detection accuracy reflected
by reducing the training set and training frequency is not the
detection accuracy that the detection method should have.
As shown in Fig. 3, the results of Intrinsic Evaluation
Method is also consistent with the results of the Extrinsic
Evaluation Method, but the difference between the results are
more clearly. However, the results of the Intrinsic Evaluation
Method only evaluated the performance of the Embedding
Method in the malicious URL detection task, and the actual
detection effect still needs further verification by the Extrin-
sic Evaluation Method. So the combination of Intrinsic and
Extrinsic Evaluation Methods is the most ideal.
6.5 Intrinsic method solve the disadvantages of
extrinsic method
6.5.1 Compare the URL embedding methods easier
As shown in Table 3,4,5and 6, even if the dimension of the
word vector is reduced from 64 to 2, the F-1 score results
are very close under the test of different machine learning
models. In this case, it is difficult for us to compare the per-
formance of each URL embedding methods and hard to select
the right URL embedding method. With the help of the intrin-
sic evaluation method, we can know the embed situation more
clearly. Like Table 7, the Skip-gram method from Word2Vec
and FastText, and GloVe method has a huge difference in the
intrinsic scores of 64D and 2D, which shows 64D embed-
ding can improve the performance of detection with the
embedding methods Word2Vec/Skip-gram, FastText/Skip-
gram and GloVe. Besides, the AUC score is an important
metric used to measure the detectors, but it can be seen from
the Tables 5and 6that apart from DT, there is not much dif-
ference among the other machine learning models, and it is
also not possible to compare various embedding methods.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
An intrinsic evaluator for embedding… Page 9 of 11 36
Fig. 3 Comparison of intrinsic and extrinsic results about related works
6.5.2 Not affected by machine learning models and
datasets
As shown in Tables 3and 5, the 64D GloVe with CNN has
lower F-1 Score than 64D Word2Vec/Skip-gram with CNN
with 50% Ratio of Benign/Malicious, but the 64D GloVe
with CNN has higher F-1 Score than 64D Word2Vec/Skip-
gram with CNN with 10% Ratio of Benign/Malicious. This
illustrates a drawback of the Extrinsic Evaluation Method,
which can be influenced by the dataset. Besides, it is also
difficult for us to compare the performance of embedding
methods such as Word2Vec and FastText, as their compar-
ison results under different Machine Learning models are
different from each other, but with the help of the Intrinsic
Evaluation Method, as shown in the Table 7, we can clearly
distinguish the differences between different methods.
7 Discussion
7.1 Problems about Existing URL Embedding
Methods
Even though most URL embedding methods in extrinsic tests
have achieved good detection accuracy, the specific cosine
similarity of each URL embedding methods shown in Table 9
Table 9 SSi milar and SDis simi lar of URL embedding methods
SSimil ar SDissi milar
Word2Vec/Skip-gram 0.88807 0.67008
Word2Vec/CBOW 0.95354 0.87822
FastText/Skip-gram 0.89620 0.70631
FastText/CBOW 0.98454 0.95189
GloVe 0.92190 0.73583
TF-IDF 0.81473 0.78621
One-hot Code 0.70278 0.65098
shows that this URL embedding methods are not the most
suitable for malicious URLs detection. They usually get high
similarity in SSi milar ,butSDissimi lar is not too low, which
means different URL Tokens are not well distinguished by
existing URL embedding methods. The most special example
is the CBOW method, whether from Word2Vec or FastText,
it got the highest SSi milar , but the difference between SSi milar
and SDissimilar is low. In general, Word2Vec is more suitable
for malicious URLs detection, and the Skip-gram algorithm
is more suitable for URL embedding.
However, the common problem with the existing URL
embedding methods is these embedding methods are origi-
nally used for NLP. They identify similar words and related
words from the perspective of natural language, which is base
on the relative position of words in the corpus. These algo-
rithms are not the most suitable for URL embedding because
the relative position of Tokens in URL is different from nat-
ural language. Besides, the treatment of polysemous Tokens
is also unsatisfactory, like the Token ’zoom’, when it is used
as the domain name ’zoom’, its meaning is different from
that of other domain names as part of the path, which makes
the cosine similarity of Tokens related to ’zoom’ very poor.
In conclusion, URL embedding methods need to solve the
above problems to obtain a better embedding performance.
7.2 Limitations and weaknesses
Although our Intrinsic method can solve the disadvantage
of Extrinsic evaluation methods that affected by the training
set and test set for machine learning, as well as issues such
as small differences in detection accuracy results, it still has
limitations and weaknesses.
7.2.1 The token pairs
The Intrinsic evaluation method do not need a test set based
on specific downstream tasks, but a corpus for embedding and
a set of Token pair is required. Firstly, regarding the selec-
tion of Token Pair, as mentioned above, the Token Pairs used
for Intrinsic Evaluation Method is manually chosen by us.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
36 Page 10 of 11 C. Qisheng, O. Kazumasa
Changing a few Token Pairs will greatly alter the final evalua-
tion results obtained. That is to say, URL embedding methods
like Word2Vec and FastText, which are relatively similar, are
likely to result in different Intrinsic Score comparison results
due to changing one or two Token Pairs, so this requires cau-
tion when choosing Token Pair. But this small change will not
affect the significant difference in the Intrinsic Score com-
parison between Context-considering and Context-agnostic
embedding methods.
Besides, as described in Sect. 4.2, Token Pair also needs
to be generated according to corpus. This means that these
Token Pair sets are bound to the current corpus, because if the
corpus does not contain one of the tokens in the token pair,
that token cannot be embedded. This will result in the need to
adjust the composition of Token Pair, and the corresponding
corpus will also need to be adjusted. The adjustment of corpus
containing a large number of URLs will require reacquiring
a large amount of URL data, which will make the evaluation
process more complex.
7.2.2 Evaluation for actual tasks
The Intrinsic Evaluation Method is used to evaluate the per-
formance of the target embedding method in embedding
URLs. It cannot evaluate the performance of specific methods
in a particular task. For example, in the malicious URL detec-
tion task described in this article, the Intrinsic Evaluation
Method cannot evaluate the performance of related detec-
tion methods in detecting malicious URLs. In conclusion, the
Intrinsic Evaluation Method we proposed is a method used
to assist the Extrinsic Method in evaluating the performance
of specific URL embedding methods, the performance of the
detection system still needs to be judged based on specific
machine learning models.
7.2.3 Establish a standard test collection
The Token Pair mentioned in this article only has 30 pairs,
which is not enough in terms of quantity. For example, the
WordSim353 collection for word vectors has 353 pairs, and
it was proposed in 2002. In recent years, most collections
have included thousands of pairs. At present, our research
has demonstrated the feasibility of the Intrinsic Evaluation
Method in evaluating the performance of URL embedding
methods, but the Intrinsic Evaluation Method is not yet a
standardized evaluation method. In the future, we will strive
to improve the Token Pair Collection and its corresponding
Corpus, so that our proposed method can truly be used for
standardized evaluation.
7.2.4 Conclusion of strengths and weaknesses
We will make a conclusion of our proposed Intrinsic Evalua-
tion Method with the strengths and weaknesses as list in the
section.
Strengths of The Intrinsic Evaluation Method:
1. As a specialized method for evaluating embedding meth-
ods, it can compare the URL Embedding Methods Easier.
2. It will not be affected by Machine Learning Models and
Datasets.
Weaknesses of The Intrinsic Evaluation Method:
1. The change of Token Pairs may significant change the
result of evaluation as well as the corresponding corpus.
2. It cannot evaluate the performance of specific methods in
a particular task such as malicious URL detection.
3. In order to be a standardized evaluation method for URL
embedding methods, a Token Pair Test Collection and its
corresponding Corpus must be built.
8 Conclusion
In this paper, we proposed an intrinsic evaluation method for
URL embedding method, and it can evaluate URL embed-
ding method without the effect of machine learning models
and data sets. Besides, we evaluated several URL embed-
ding methods with intrinsic and extrinsic method and found
that the results of traditional extrinsic evaluation methods
are hard to compare in evaluating URL embedding meth-
ods, and the results of intrinsic evaluation method proved
intrinsic evaluation method plays its role in URL embedding
methods evaluation. At last, we found that Word2Vec embed-
ding method and Skip-gram algorithm are suitable for URL
embedding according to the results of the evaluation.
About future work, we will focus on improving the Intrin-
sic Evaluation Method, including increasing the number
of Token Pairs, and addressing the issue of how to per-
form Intrinsic Evaluation without using Token segmentation
method. Besides, using larger and more diverse datasets that
cover various domains, languages and collection periods will
be a future challenge for us to further verify the feasibility of
our proposed method.
Supplementary Information The online version contains supplemen-
tary material available at https://doi.org/10.1007/s10207- 024-00950-
9.
Author Contributions Chen and Omote reviewed this idea and con-
firmed the original manuscript of the paper. Chen was responsible for
analyzing and evaluating the research and writing the paper, while
Omote supervised the writing of the paper.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
An intrinsic evaluator for embedding… Page 11 of 11 36
Funding This work was supported by JSPS KAKENHI Grant Number
JP22H03588.
Data availability The corpus of URL is crawled from AlexaTop https://
www.alexa.com/topsites, which stopped service in 2022, but we still can
obtain the last version of the site rank. Uploaded as BenignURL.csv.The
URLhaus database https://urlhaus.abuse.ch/ provides malicious URLs.
Uploaded as MaliciousURL.csv.
Declarations
Conflict of interest Not applicable.
Open Access This article is licensed under a Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License,
which permits any non-commercial use, sharing, distribution and repro-
duction in any medium or format, as long as you give appropriate credit
to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if you modified the licensed mate-
rial. You do not have permission under this licence to share adapted
material derived from this article or parts of it. The images or other
third party material in this article are included in the article’s Creative
Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons
licence and your intended use is not permitted by statutory regula-
tion or exceeds the permitted use, you will need to obtain permission
directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/.
References
1. Chen,Q., Omote,K.: Toward the establishment of evaluating URL
embedding methods using intrinsic evaluator via malicious URLs
detection. In: 38TH International Conference on ICT Systems
Security and Privacy Protection (IFIP SEC), pp. 350–360 (2023)
2. Goldberg,Y., Levy, O.: word2vec explained: deriving Mikolov
et al.’s negative-sampling word-embedding method, CoRR
abs/1402.3722 (2014). http://arxiv.org/abs/1402.3722
3. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H.,
Mikolov, T.: Fasttext. zip: compressing text classification models,
arXiv preprint arXiv:1612.03651 (2016)
4. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors
for word representation. In: Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP)
(2014), pp. 1532–1543
5. Rajaraman, A., Ullman, J.D.: Data mining, Cambridge
University Press (2011), pp. 1–17. https://doi.org/10.1017/
CBO9781139058452.002
6. Pradeepa, G., Devi, R.: Review of malicious URL detection using
machine learning. In: Soft Computing for Security Applications,
pp 97–105, Springer (2022)
7. Chen, Q., Omote, K.: A three-step framework for detecting mali-
cious URLs. In: 2022 International Symposium on Networks,
Computers and Communications (ISNCC), IEEE, pp. 1–6 (2022)
8. Wang, B., Wang, A., Chen, F., Wang, Y., Kuo, C.C.J.: Evaluat-
ing word embedding models: methods and experimental results.
APSIPA Trans Signal Inform Process 8, e19 (2019)
9. Yuan, H., Yang, Z., Chen, X., Li, Y., Liu, W.: URL2Vec:
URL modeling with character embeddings for fast
and accurate phishing website detection. In: 2018
ISPA/IUCC/BDCloud/SocialCom/SustainCom (2018), pp.
265–272. https://doi.org/10.1109/BDCloud.2018.00050
10. Kaneko, S., Yamada, A., Sawaya, Y., Thao, T.P., Kubota, A.,
Omote, K.: Detecting malicious websites by query templates. In:
Simion, E., Géraud-Stewart, R. (eds.) Innovative Security Solu-
tions for Information Technology and Communications. Springer,
New York (2020)
11. Le, H., Pham, Q., Sahoo, D., Hoi, S.C.H.: URLNet: learning a URL
representation with deep learning for malicious URL detection,
CoRR abs/1802.03162 (2018)
12. Saxe, J., Berlin, K.: eXpose: a character-level convolutional neural
network with embeddings for detecting malicious URLs, file paths
and registryb keys, arXiv preprint arXiv:1702.08568 (2017)
13. Yan, X., Xu, Y., Cui, B., Zhang, S., Guo, T., Li, C.: Learning
URL embedding for malicious website detection. IEEE Trans. Ind.
Inform. 16(10), 6673 (2020)
14. Sahingoz, O.K., Buber, E., Demir, O., Diri, B.: Machine learning
based phishing detection from URLs. Expert Syst. Appl. 117, 345
(2019)
15. Do Xuan, C., Nguyen, H.D., Nikolaevich, T.V., et al.: Malicious
URL detection based on machine learning. Int. J. Adv. Comput.
Sci. Appl. 11(1), 1 (2020)
16. Patgiri, R., Katari, H., Kumar, R., Sharma, D.: Empirical study on
malicious URL detection using machine learning. In: International
Conference on Distributed Computing and Internet Technology,
pp. 380–388, Springer (2019)
17. Djaballah, K.A., Boukhalfa, K., Ghalem, Z., Boukerma, O.: A
new approach for the detection and analysis of phishing in social
networks: the case of Twitter. In: 2020 Seventh International Con-
ference on Social Networks Analysis, Management and Security,
IEEE, pp. 1–8 (2020)
18. Kumar,Y., Subba,B.: A lightweight machine learning based secu-
rity framework for detecting phishing attacks. In: 2021 Interna-
tional Conference on Communication Systems & Networks, IEEE,
pp. 184–188 (2021)
19. Patil, D., Patil, J.: Feature-based malicious URL and attack type
detection using multi-class classification. ISC Int. J. Inform. Secur.
10(2), 141 (2018)
20. Zamir, A., Khan, H.U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A.,
Hamdani, M.: Phishing web site detection using diverse machine
learning algorithms, The Electronic Library (2020)
21. Catak, F., Ozgur, K., Sahinbas, Dörtkarde¸s, V.: Malicious URL
detection using machine learning. In: Artificial Intelligence
Paradigms for Smart Cyber-Physical Systems, IGI Global, pp. 160–
180 (2021)
22. Alam, M.N., Sarma, D., Lima, F.F., Saha, I., Hossain, S., et al.:
Phishing attacks detection using machine learning approach. In:
2020 Third International Conference on Smart Systems and Inven-
tive Technology, IEEE, pp. 1173–1179 (2020)
23. Ho, T.K.: Random decision forests. In: Proceedings of 3rd Interna-
tional Conference on Document Analysis and Recognition, vol. 1
(1995), vol. 1, pp. 278–282. https://doi.org/10.1109/ICDAR.1995.
598994
24. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q.,
Liu, T.Y.: Lightgbm: a highly efficient gradient boosting decision
tree. Adv. Neural Inform. Process. Syst. 30, 3146 (2017)
25. Top sites—alexa. https://www.alexa.com/topsites
26. Urlhaus | malware URL exchange. https://urlhaus.abuse.ch/
Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm , and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations , the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.
Chapter
Full-text available
Recently, with the increase in Internet usage, cybersecurity has been a significant challenge for computer systems. Different malicious URLs emit different malicious software and try to capture user information. Signature-based approaches have often been used to detect such websites and detected malicious URLs have been attempted to restrict access by using various security components. This chapter proposes using host-based and lexical features of the associated URLs to better improve the performance of classifiers for detecting malicious web sites. Random forest models and gradient boosting classifier are applied to create a URL classifier using URL string attributes as features. The highest accuracy was achieved by random forest as 98.6%. The results show that being able to identify malicious websites based on URL alone and classify them as spam URLs without relying on page content will result in significant resource savings as well as safe browsing experience for the user.
Article
Full-text available
Due to the rapid growth of the Internet, users change their preference from traditional shopping to the electronic commerce. Instead of bank/shop robbery, nowadays, criminals try to find their victims in the cyberspace with some specific tricks. By using the anonymous structure of the Internet, attackers set out new techniques, such as phishing, to deceive victims with the use of false websites to collect their sensitive information such as account IDs, usernames, passwords, etc. Understanding whether a web page is legitimate or phishing is a very challenging problem, due to its semantics-based attack structure , which mainly exploits the computer users' vulnerabilities. Although software companies launch new anti-phishing products, which use blacklists, heuristics, visual and machine learning-based approaches, these products cannot prevent all of the phishing attacks. In this paper, a real-time anti-phishing system, which uses seven different classification algorithms and natural language processing (NLP) based features, is proposed. The system has the following distinguishing properties from other studies in the literature: language independence, use of a huge size of phishing and legitimate data, real-time execution, detection of new websites, independence from third-party services and use of feature-rich classifiers. For measuring the performance of the system, a new dataset is constructed, and the experimental results are tested on it. According to the experimental and comparative results from the implemented classification algorithms, Random Forest algorithm with only NLP based features gives the best performance with the 97.98% accuracy rate for detection of phishing URLs.
Chapter
In order to compare the performance of the malicious URLs detection method, researches used the F-score or other detection accuracy to evaluate, but there are some difficulties in evaluating the URL embedding method used in malicious URLs detection because the detection accuracy is also effect by machine learning or deep learning models and data sets. An evaluation method of URL embedding method that is not affected by other factors is particularly important. In this paper, we proposed an intrinsic evaluation method for URL embedding method that is not affected by machine learning models or deep learning models and data sets. Besides, We analyse some URL embedding methods according to intrinsic and extrinsic methods and offer a guidance in selecting suitable embedding methods in URL by analysing the results.
Chapter
Web URLs are the base for Internet to locate resources uniquely on the Internet. Recent report from poloalto networks states that more 86 thousand malicious URLs were registered during the Covid period between March and April 2020. Cyber-attacks through malicious URL causes lose more than billion in every year. Attacks through malicious URL are the handy way for the cyber-criminals. Systematic approaches are required to detect the malicious URL to prevent the cyber-attacks. Researchers proposed several techniques to detect the malicious URL. But it requires continuous efforts to block newly generated attacks. This paper presents overview of malicious URL detection techniques and recent research works and the issues. Also highlights the research challenges in malicious URL detection, which can help for the future researchers to bring out new solutions.
Conference Paper
Abstract: Evolving digital transformation has exacerbated cybersecurity threats globally. Digitization expands the doors wider to cybercriminals. Initially cyberthreats approach in the form of phishing to steal the confidential user credentials. Usually, Hackers will influence the users through phishing in order to gain access to the organizatlou's digital assets and networks. With security breaches, cybercriminals execute ransomware attack, get unauthorized access, and shut down systems and even demand a ransom for releasing the access. Anti-phishing software and techniques are circumvented by the phishers for dodging tactics. Though threat intelligence and behavioural analytics systems support organizations to spot the unusual traffic patterns, still the best practice to prevent phishing attacks is defended in depth. In this perspective, the proposed research work has developed a model to detect the phishing attacks using machine learning (ML) algorithms like random forest (RF) and decision tree (DT). A standard legitimate dataset of phishing attacks from Kaggle was aided for ML processing. To analyze the attributes of the dataset, the proposed model has used feature selection algorithms like principal component analysis (PCA). Finally, a maximum accuracy of 97% was achieved through the random forest algorithm.