PreprintPDF Available

GlyphNet: Homoglyph domains dataset and detection using attention-based Convolutional Neural Networks

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Cyber attacks deceive machines into believing something that does not exist in the first place. However, there are some to which even humans fall prey. One such famous attack that attackers have used over the years to exploit the vulnerability of vision is known to be a Homoglyph attack. It employs a primary yet effective mechanism to create illegitimate domains that are hard to differentiate from legit ones. Moreover, as the difference is pretty indistinguishable for a user to notice, they cannot stop themselves from clicking on these homoglyph domain names. In many cases, that results in either information theft or malware attack on their systems. Existing approaches use simple, string-based comparison techniques applied in primary language-based tasks. Although they are impactful to some extent, they usually fail because they are not robust to different types of homoglyphs and are computationally not feasible because of their time requirement proportional to the string length. Similarly, neural network-based approaches are employed to determine real domain strings from fake ones. Nevertheless, the problem with both methods is that they require paired sequences of real and fake domain strings to work with, which is often not the case in the real world, as the attacker only sends the illegitimate or homoglyph domain to the vulnerable user. Therefore, existing approaches are not suitable for practical scenarios in the real world. In our work, we created GlyphNet, an image dataset that contains 4M domains, both real and homoglyphs. Additionally, we introduce a baseline method for a homoglyph attack detection system using an attention-based convolutional Neural Network. We show that our model can reach state-of-the-art accuracy in detecting homoglyph attacks with a 0.93 AUC on our dataset.
Content may be subject to copyright.
GlyphNet: Homoglyph domains dataset and detection using attention-based
Convolutional Neural Networks
Akshat Gupta1, Laxman Singh Tomar*2, Ridhima Garg*3
1University of Stuttgart
2Robofied
3Friedrich Alexander Universit¨
at
st180429@stud.uni-stuttgart.de, laxman.tomar@robofied.com, ridhima.garg@fau.de
Abstract
Cyber attacks deceive machines into believing something that
does not exist in the first place. However, there are some to
which even humans fall prey. One such famous attack that at-
tackers have used over the years to exploit the vulnerability of
vision is known to be a Homoglyph attack. It employs a pri-
mary yet effective mechanism to create illegitimate domains
that are hard to differentiate from legit ones. Moreover, as the
difference is pretty indistinguishable for a user to notice, they
cannot stop themselves from clicking on these homoglyph
domain names. In many cases, that results in either informa-
tion theft or malware attack on their systems. Existing ap-
proaches use simple, string-based comparison techniques ap-
plied in primary language-based tasks. Although they are im-
pactful to some extent, they usually fail because they are not
robust to different types of homoglyphs and are computation-
ally not feasible because of their time requirement propor-
tional to the string’s length. Similarly, neural network-based
approaches are employed to determine real domain strings
from fake ones. Nevertheless, the problem with both meth-
ods is that they require paired sequences of real and fake do-
main strings to work with, which is often not the case in the
real world, as the attacker only sends the illegitimate or ho-
moglyph domain to the vulnerable user. Therefore, existing
approaches are not suitable for practical scenarios in the real
world. In our work, we created GlyphNet, an image dataset
that contains 4M domains, both real and homoglyphs. Addi-
tionally, we introduce a baseline method for a homoglyph at-
tack detection system using an attention-based convolutional
Neural Network. We show that our model can reach state-of-
the-art accuracy in detecting homoglyph attacks with a 0.93
AUC on our dataset.
Keywords: Homoglyph Attacks, Convolutional Neural
Networks, Cyber Security, Phishing
Introduction
In cyber security, attackers employ different attacks to in-
filtrate our systems and networks, with the objective vary-
ing from stealing crucial information to inflicting system
damage. One such deceptive attack is the homoglyph attack
(Woodbridge et al. 2018), which involves an attacker trying
to fool humans and computer systems by using characters
*These authors contributed equally.
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
and symbols that may appear visually similar to characters
used in real domain and process names but are different. For
example, a typical homoglyph attack may involve changing
“d” to “cl”, “o” to θ”, and “l” to “1”.
Figure 1: example of a real domain and their homoglyphs
Some of the above substitutions can be difficult for
the naked eye to detect, as shown in Figure1, It would
mean that users would be easily susceptible to clicking
on the homoglyph links, more so when navigating from
one website to another. The problems arising from such an
attack are of two types: a) Deceiving humans to believe that
an illegitimate domain name is real by fooling the users,
resulting in users using fake webpages as if they were the
real ones. b) Create fake academic documents and papers
by changing the real strings with homoglyphs to deceive
plagiarism detection tools such as Grammarly.com
Both types of problems are hard to detect and hence re-
quire robust methods to identify an attack before it causes
any information breach. Previous approaches mainly used
methods from comparative algorithms such as edit distance
to identify homoglyph attacks from legit strings(Damerau
1964). Any domain name string that returned an edit dis-
tance beyond an acceptable threshold was considered a ho-
moglyph. Edit distance covers simple approaches like inser-
tion, deletion, transposition, swapping, and substitution. Due
to this shortcoming, a slight change to the illegitimate do-
main name can easily bypass it as quickly as a real one. Now,
a slightly better version of it was proposed, which was called
Visual Edit Distance(Ristad and Yianilos 1998). It proposes
to have a particular edit distance for the visual similarity of
arXiv:2306.10392v1 [cs.CR] 17 Jun 2023
Figure 2: Real text on Grammarly plagiarism detector
Figure 3: Homoglyph text on Grammarly plagiarism detec-
tor
the two domain name strings. However, these methods were
more relevant in academia and had negligible prevalence in
the real world. A homoglyph attack differs from a phishing
attack because domain names in the former are hardly dis-
tinguishable but can be apparent in the latter.
We have taken the famous poem ”The Road Not Taken”
by Robert Frost to demonstrate this concept. In Figure 2, we
have taken the poem text and run Grammarly’s Plagiarism
Detector tool. It reports 100% plagiarism, which is correct
but later when we passed the homoglyphed version of the
same text, it reports the text to be 100% unique, as shown in
Figure 3. This proves that even today’s state-of-the-art sys-
tems cannot effectively deal with texts comprising homo-
glyphs.
Recently, Microsoft obtained a court order to remove
many fraudulent ”homoglyph” domains used to conduct
fraud and pose as Office 365 users.(Page 2021) following a
customer complaint about a business email compromise at-
tack, Microsoft conducted an investigation and discovered
that the unidentified criminal organization responsible for
the attack also created 17 other malicious domains, which
were combined with the customer credentials that had been
stolen to gain unauthorized access to Office 365 accounts
and monitor the contacts of the customers.
Microsoft stated that the cybercriminals have caused and
continue to cause irreparable injury to Microsoft, its cus-
tomers, and the general public. The complaint also stated
that the cybercriminals have illegally accessed customer ac-
counts, monitored customer email traffic, gathered informa-
tion on pending financial transactions, and criminally imper-
sonated [Office 365] customers.
According to studies, this attack hit 71% organizations in
2021. Sixty-two countries people were the subject of a mas-
sive cyberattack last year.
In this research, we aim to create a data set that can help
expand research on homogylph attacks. We propose to apply
an attention-based Convolutional Neural Network (CNN) to
detect homoglyphs without the need to provide paired data.
Additionally, our model achieves a significant performance
boost compared to other approaches due to its architectural
design. Our method can be applied directly as an API or
web service for an organization to check a domain or process
name before accessing it. For evaluation, we compared our
performance with other baselines and found that our model
outperforms them. Moreover, our approach also addresses
the problem of unpaired data setting, which is often the case
in the real world.
The major contributions of our research are as follows:
1. Created a benchmark dataset of 4 million real and homo-
glyph domain images based on known homoglyph pat-
terns and URLs. It is generated via strings from single-
and dual-character noise sampled using a Gaussian dis-
tribution over a homoglyph character pool.
2. A method that uses an image dataset for detecting homo-
glyphs involving an attention-based convolutional neural
network trained in a supervised fashion achieves a better
AUC score than the other existing baselines.
The paper’s organization starts by introducing the prob-
lem faced by existing approaches to detect homoglyph-
based phishing attacks in both academia and the real world.
In Related Work, we discuss the existing approaches which
propound the idea of solving this problem with either string
matching or Deep Learning based methods like Siamese
Neural Networks and GANs. We have explained their ma-
jor pitfalls in terms of generalizing capabilities and feasi-
bility. In the Dataset section, a comprehensive description is
provided of the generation of the proposed images dataset. It
follows a brief description of our attention-based CNN base-
line implementation. The Experimentation section describes
dataset splitting, metrics used, and other settings. Later, in
the Results section, we examine the results and scores ob-
tained after the experiments conducted in the last section.
Both data and baseline implementation results are validated
and explained with the help of an elegant table within the
same section. The following section, Discussion, presents
experiments we tried that did not work. Finally, the Conclu-
sion Section summarizes the observations and contributions.
Related Work
The work by (Ginsberg and Yu 2018) used a Siamese Neu-
ral Network to detect homoglyphs using a paired dataset.
This dataset included pairs of strings; one was a real domain
name, and the other was a homoglyph domain name. In their
work, they converted this pair of strings into binary images
that were later fed to a Siamese Neural Network(Koch et al.
2015). The Siamese neural network uses two identical con-
volutional neural networks(LeCun, Bengio et al. 1995) to
learn the visual difference between a pair of images. They
were applied to domains such as healthcare, finance, and
others but have recently gained popularity in cyber security.
Though(LeCun, Bengio et al. 1995) their work showed
significant improvement from previous baselines but suf-
fered from two major pitfalls:
1. In online security systems, it is impossible to provide
paired data, without which these systems will not work.
2. It cannot be used in academia due to the inability to find
a paired word for each word present in a scientific article.
Therefore, although this approach performs well, it cannot
be employed in real-world systems.
The traditional solutions to prevent homoglyph attacks
were inspired by genomics(Lu, Mohammed, and Wang
2019), which proposed the idea that homoglyph domains
are in string formats and, therefore, should be compared
with legitimate ones to detect whether they are real or not.
Edit Distance(Ristad and Yianilos 1998) is the measure of
the minimum number of operations required to transform
one sequence (domain or process name string in our case)
into another. If the value exceeds an acceptable threshold,
it should predict as homoglyph. This looks effective but not
when giving it a second thought. The reason is that in cases
like ”google.com” and ”go0gle.com”, edit distance would
return only ’1’ which does not look so threatening but is a
homoglyph domain name. Furthermore, a paired sequence
of strings is required to make comparisons, which would not
be the case if it was a homoglyph of a new domain name. Fi-
nally, in the real world, this approach lacked severely good
results.
Phishing attacks(Hong 2012) should not be confused
with homoglyph attacks. Phishing is an attack involving
the attacker sending homoglyph/false/fake links that ap-
pear to be coming from a trusted vendor. It leads to in-
formation compromise(Helms, Ettkin, and Morris 2000),
data breaches(Cheng, Liu, and Yao 2017), and financial
fraud(Reurink 2018). The difference between Phishing and
Homoglyphs is that the former uses tricks such as longer,
bad, and misspelled domain names and URLs(Ma et al.
2009) to fool people. At the same time, the latter takes ad-
vantage of the inability to differ in visually similar but differ-
ent domain name strings. Thus, it is required to create better
solutions for homoglyph detection.
Siamese Neural Networks
The Siamese neural network architecture is proposed to de-
tect homoglyphs using a paired data set. This dataset in-
cluded pairs of strings, one was a real domain name, and the
other was a homoglyph domain name. Each instance was a
tuple that contained a real domain string, a homoglyph do-
main string, and a score that denotes whether the second el-
ement is a valid homoglyph of the first or not as part of its
elements. In their work, they converted this pair of strings
into binary images, images that were later fed to a Siamese
Neural Network. However, we observed a significant differ-
ence while reproducing the results in our dataset.
Figure 4: Siamese neural network architecture(Woodbridge
et al. 2018)
Figure 5: PhishGAN architecture(Sern, David, and Hao
2020)
PhishGANs
Approaches such as Siamese Neural Networks suffered
severely in terms of performance due to lack of data, as
they only had close to 91kreal domain images. As a re-
medial solution, we were required to produce comprehen-
sive data to train our models well. Recently, Lee Joon Sern
et al. proposed PhishGANs to generate synthetic data(Sern,
David, and Hao 2020). They discussed creating a gen-
erative adversarial network(Goodfellow et al. 2014) that
aimed to create images similar to real domain names to
increase existing data sets. PhishGANs(Sern, David, and
Hao 2020) being a GAN(Goodfellow et al. 2014) involved
a generator and a discriminator, both trained in an ad-
versarial fashion such that the generator is trained well
enough to produce images similar to those of a real do-
main which the discriminator cannot detect. Later, these im-
ages were fed to a different network for binary classifica-
tion aimed at distinguishing real domain names from ho-
moglyphs. They used UNet(Ronneberger, Fischer, and Brox
2015) architecture as a generator using a custom loss func-
tion called a dot product loss. The PhishGANs(Sern, David,
and Hao 2020) were trained similarly to how Pix2Pix(Isola
et al. 2017) is trained. Later, for classification purposes, a
new architecture was defined and called homoglyph iden-
tifier (HI) using CNN(LeCun, Bengio et al. 1995) as an
encoder using a triplet loss function (Hoffer and Ailon
2015) as input, the positive domain (google.com), the an-
chor domain (go0gle.com) and the negative domain (ap-
ple.com). On some popular domains, such as youtube.com
and facebook.com, HI achieved an accuracy of roughly 0.81
while testing their homoglyphs. On an unseen domain, HI
achieved an accuracy of 0.76 while feeding it back again in
PhishGANs(Sern, David, and Hao 2020) and generating its
homoglyphs, and later training on them, which helped de-
tect its homoglyphs using 0.86 accuracy. Although the idea
of generating synthetic data using GANs(Goodfellow et al.
2014) looks promising and intriguing but is not motivating
when it comes to real-world usage. GANs(Goodfellow et al.
2014), in general, is one of the trickiest architectures in Deep
Learning(LeCun, Bengio, and Hinton 2015) since their ad-
vent and are often found to have issues while training in
the real world, which is not the case in the constrained en-
vironment of academia. It is common to encounter issues
like problems in convergence, generator oscillating between
generating specific examples in the domain, and multiple in-
puts resulting in generating the same output. Also, the per-
formance increase was not drastic enough to compel us for
its usage.
Dataset
The work by (Woodbridge et al. 2018) proposed their
custom paired data set that comprises 91kreal domains and
900khomoglyphs. Each real domain is used to generate
its respective homoglyphs. Each point in this dataset is
a three-element tuple denoting domain, homoglyph, and
score. Here, if the value of the score is 1.0, then it is
a valid homoglyph of the real domain. The real-world
data limitation to Homoglpyh-based attacks is the lack of
publicly available data sets.
Proposed dataset: GlyphNet
We have proposed a dataset consisting of real and ho-
moglyph domains. To generate homoglyph domains, real
domains are needed. We have obtained domains from the
Domains Project(Turkynewych 2020). This repository is
one of the largest collections of publicly available active
domains. The entire repository comprises 500M domains,
and we restricted our work to 2M domains due to hardware
restrictions.
Homoglyph Creation Algorithm
Homoglyph Generation is an important task, as one needs
to ensure enough randomness to make it appear real and
keep the process simple enough to fool the target. Publicly
available tools like dnstwist(Ulikowski 2015) replace every
character in the real input domain with their respective
glyphs. It generates poor homoglyphs for the large part
because it relies on paired data which is not fit to serve the
purpose practically. We created our novel algorithm for the
generation of homoglyph domains to ensure that real ho-
moglyphs are generated with randomness and closeness. To
achieve this, we sample homoglyph noise characters using
Gaussian sampling(Boor, Overmars, and Van Der Stappen
1999) from the glyph pool. We used 1M real domains to
generate 2Mhomoglyphs with a single glyph character and
introduce diversity in our dataset; we reran this algorithm
Figure 6: Rendered images from the dataset, 0; homoglyph
domain and, 1; real domain
on the remaining 1M real domains to generate homoglyph
domains with two character glyphs. Finally, we have the
4M real and homoglyph domains.
Image Generation
Homoglyph attacks exploit the weakness of human vision
to differentiate real from homoglyph domain names. From
a visual perspective, we are interested in learning the visual
characteristics of real and homoglyph domain names. To do
so, we rendered images from the real and homoglyph strings
generated via our algorithm. We have used ARIAL Typeface
as our chosen font, a 28 font size, on a black background
with white text from the middle left of the image; the image
size is 150 ×150.
Methodology
This section presents our approach to building an end-to-
end homoglyph detection system. We build on attention-
based(Bahdanau, Cho, and Bengio 2014)(Vaswani et al.
2017) convolutional neural network(LeCun, Bengio et al.
1995) that aims to exploit the visual dissimilarity between
real and homoglyph domain names. The architecture of our
model is shown in Figure 7 and Figure 8.
The rendered images are then used as input to the CNN to
learn the desired visual feature information. The model con-
sists of four Conv2D layers to learn visual information such
as edges, curves, and strokes. Each convolutional layer is
paired with a max-pooling layer to perform dimensionality
reduction on the learned features. This model is developed
in keras(Chollet et al. 2015). Each convolution block is fol-
lowed by a convolutional block attention module (CBAM),
as described in the following.
Dataset Name Real Homoglyph Total
Domain and Process Strings(Woodbridge et al. 2018) 90k900k990k
Similar and Dissimilar Pairs(Majumder et al. 2020) 2257 2257 4514
GlyphNET (Ours) 2000k2000k4000k
Table 1: Dataset comparison
Figure 7: Our neural network architecture
Figure 8: Zoom in view of conv-attention module
Attention processes boost the strength of representation
by focusing on essential traits and suppressing unneeded
ones. It uses the feed-forward convolutional neural net-
work’s CBAM, a specific and efficient attention module.
Given a preliminary feature map, the module successively
infers attention maps along the channel and spatial dimen-
sions. It then multiplies the attention maps by the prelimi-
nary feature map to achieve adaptive feature refinement. The
overall attention process is summarized as follows:
F=Mc(F)F,
F′′ =Ms(F)F,
1. Given an intermediate feature map F RC×H×Was
input. Crepresents a number of channels, Hand Wrep-
resent the height and width of the feature map Frespec-
tively.
2. CBAM sequentially infers a 1D channel attention map
Mc RC×1×1
3. And a 2D spatial attention map Ms R1×H×W
4. element-wise multiplication
For the sake of non-linearity, the RELU activation func-
tion is used.
Experimentation
Dataset and Metrics
We have split our dataset into three parts, train, validation,
and test, with a ratio of 70 : 20 : 10, respectively which
Architecture Accuracy Precision Recall F1-score AUC
Siamese CNN(Woodbridge et al. 2018) 0.79 0.78 0.71 0.74 0.78
Ensemble CNN(Majumder et al. 2020) 0.83 0.82 0.79 0.80 0.83
PhishGAN(Sern, David, and Hao 2020) 0.71 0.74 0.65 0.69 0.71
Attention CNN (Ours) 0.93 0.93 0.93 0.93 0.93
Table 2: Model performance comparison on our dataset
amounts to 2.8M,0.8M, and 0.4Mimages in train, valida-
tion, and test sets. Each image size is 150 ×150.
We use accuracy for measuring the performance of the
classification task. Since accuracy can sometimes be mis-
leading in a binary classification task, especially for unbal-
anced data sets, we consider precision, recall, and F1 score
as our evaluation metrics, even though our dataset is bal-
anced. We have also used the AUC score to compare our
solution with some other works.
Experimental Settings
For the training part, we used binary cross-entropy as a Loss
Function. We have used RMSProp optimizer to optimize the
loss obtained from the binary cross-entropy loss function,
with a learning rate of 10e4, and the network is trained for
30 epochs with early stopping. We trained with a batch size
of 256. We evaluated the performance of our model in terms
of accuracy vs. epochs and loss vs. epochs plots.
Results
We evaluated our model on two unpaired data sets for do-
main names. We took an input string from the dataset we
created in the previous section, converted it into an image,
and fed it to the model to generate a binary label. The re-
sults for the domain names are tabulated in Table 2. Out of
the 400ktest images, our model correctly categorized 372k
images resulting in 0.93 accuracy. Our model achieved an
f1-score of 0.93,13 points higher than the previous mod-
els. Our model outperforms other baselines and compara-
ble works on the other metrics, including accuracy, preci-
sion, recall, and AUC. The performance of other models on
our dataset was also below par compared with the proposed
datasets in their works, signifying our dataset’s variations,
difficulty, and importance.
Our dataset, code, and models are publicly available un-
der MIT LICENSE and can be accessed from our project’s
GitHub repository1
Discussion
We now discuss some interesting observations and experi-
ments which did not work and possible explanations regard-
ing them.
Using only Grayscale Images
During the image rendering phase, where we generated im-
ages from the data set containing real and homoglyph do-
mains, we experimented with generating colored images in-
1https://github.com/Akshat4112/Glyphnet
stead of grayscale ones. We used (73,109,137) as the back-
ground color while (255,255,0) as the color of the text to
be written. However, the network trained from these col-
ored images was always found to be underperforming the
network trained on grayscale images. One possible reason
might be that the grayscale involves black and white as two
colors, which are two extremes. Hence, it preserves the dif-
ference in adjoining pixels at the periphery of the letter and
background pixels. Meanwhile, the colors though appearing
to us as distinctly different, suffered to preserve the differ-
ence when later passed through resizing operations. We per-
form data augmentation on our data and later train our net-
work using the data generated, but it leads to a downfall in
accuracy. One possible reason might be that data augmen-
tation(Shorten and Khoshgoftaar 2019) is used in those sce-
narios where we expect distinctive image features to exist,
but they do not exist in the actual data set. It can be un-
derstood from a Cat vs. Dogs example. Usually, data sets
contain cats and dogs in limited positions in the pictures, so
our model fails to recognize some of the real images. The
reason is that in the real world, either a cat or a dog might
turn their heads and might be sitting in different postures,
which makes it difficult for our model to locate distinctive
features like whiskers and pointy ears in cats and tongues in
case of dogs in the absence of large amounts of data cater-
ing these considerations. Therefore, to mimic such behav-
ior, Data Augmentation is used, which helps to create all
these different types of images. However, in our case, us-
ing it leads to flipped characters, and rotated images lead to
anchor and tilde signs over letters going in different direc-
tions, which is not the case with real-world strings. There-
fore, data augmentation was, in fact, counterproductive for
our use case.
We rendered images of sizes 256 ×256 during the im-
age generation phase. Apart from the image size 256 ×256,
at which we observed the best results, we tried experiment-
ing with the following image sizes: 128 ×128,150 ×150,
224 ×224, and 512 ×512. The smaller the image size, the
more performance degradation there is relative to it. An in-
crease in size did not lead to any significant improvement
but increased the training time of the model. Hence, we use
256 ×256 image size.
Building Model without Transfer Learning
We train a base network on a base dataset and task and
then reuse the learned features or transfer them to a second
target network to be trained on a target dataset. This pro-
cess will tend to work if the features are general, meaning
suitable for both base and target tasks rather than specific
to the base task. We performed experiments with transfer
learning(Pan and Yang 2009) by incorporating networks like
VGG16(Simonyan and Zisserman 2014), Resnet18(He et al.
2016), Resnet34, Resnet50, Wide ResNet-101-2, ResNeXt-
50-32x4d and ResNeXt-101-32x8d which were trained on
ImageNet(Deng et al. 2009) dataset. Our experiments did
not obtain good accuracy using these architectures, either
pre-trained or from scratch.
There are two possible reasons:
1) Large number of hidden layers: These architectures
have many hidden layers ranging from 16 up to 100. The
deeper the network, the more it tries to aggregate the learned
features to create high-level features. It works well in images
of real-world entities, but in our context, it does not help as
these are just images generated from strings. Going further
deep into the network makes the network lose all the subtle
features from parts of strings like tilde and apostrophes. It
has learned to differentiate real from homoglyph strings.
2) Pre-trained in a data set of different domains: Another
reason is that these networks were pre-trained on the Ima-
geNet dataset, which contains images from real-world enti-
ties, but does not have images similar to our problem. Hence,
using a pre-trained network having weights learned from
such images instead of our domain problem did not help.
We obtained an accuracy of 63% to 67% using the above
architecture.
Conclusion
In this work, we created a first-of-its-kind large-scale homo-
glyph phishing image dataset comprising 4M images of real
and homoglyph domains. Later, we presented a baseline that
relied on learning features from an attention-based convolu-
tional neural network using our constructed data set to differ-
entiate real domain names from homoglyph domain names
to avoid homoglyph attacks. Our dataset and approach are
robust because we can generalize on unseen homoglyphs as
compared to other approaches which are data-dependent for
every single inference, which leads it to outperform existing
approaches. We believe this work is significant and provides
an important benchmark to propel work in this area, and its
applications would serve as a safeguard against phishing at-
tacks in the real world.
References
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-
chine translation by jointly learning to align and translate.
arXiv preprint arXiv:1409.0473.
Boor, V.; Overmars, M. H.; and Van Der Stappen, A. F. 1999.
The Gaussian sampling strategy for probabilistic roadmap
planners. In Proceedings 1999 IEEE International Confer-
ence on Robotics and Automation (Cat. No. 99CH36288C),
volume 2, 1018–1023. IEEE.
Cheng, L.; Liu, F.; and Yao, D. 2017. Enterprise data breach:
causes, challenges, prevention, and future directions. Wi-
ley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 7(5): e1211.
Chollet, F.; et al. 2015. Keras.
Damerau, F. J. 1964. A technique for computer detection
and correction of spelling errors. Communications of the
ACM, 7(3): 171–176.
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-
Fei, L. 2009. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and
pattern recognition, 248–255. Ieee.
Ginsberg, A.; and Yu, C. 2018. Rapid homoglyph prediction
and detection. In 2018 1st International Conference on Data
Intelligence and Security (ICDIS), 17–23. IEEE.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.
2014. Generative adversarial nets. Advances in neural in-
formation processing systems, 27.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
ual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, 770–778.
Helms, M. M.; Ettkin, L. P.; and Morris, D. J. 2000. The risk
of information compromise and approaches to prevention.
The Journal of Strategic Information Systems, 9(1): 5–15.
Hoffer, E.; and Ailon, N. 2015. Deep metric learning us-
ing triplet network. In International workshop on similarity-
based pattern recognition, 84–92. Springer.
Hong, J. 2012. The state of phishing attacks. Communica-
tions of the ACM, 55(1): 74–81.
Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-
to-image translation with conditional adversarial networks.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, 1125–1134.
Koch, G.; Zemel, R.; Salakhutdinov, R.; et al. 2015. Siamese
neural networks for one-shot image recognition. In ICML
deep learning workshop, volume 2, 0. Lille.
LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning.
nature, 521(7553): 436–444.
LeCun, Y.; Bengio, Y.; et al. 1995. Convolutional networks
for images, speech, and time series. The handbook of brain
theory and neural networks, 3361(10): 1995.
Lu, Y.; Mohammed, N.; and Wang, Y. 2019. Homoglyph
attack detection with unpaired data. In Proceedings of the
4th ACM/IEEE Symposium on Edge Computing, 377–382.
Ma, J.; Saul, L. K.; Savage, S.; and Voelker, G. M. 2009.
Identifying suspicious URLs: an application of large-scale
online learning. In Proceedings of the 26th annual interna-
tional conference on machine learning, 681–688.
Majumder, M. T. H.; Rahman, M. M.; Iqbal, A.; and Rah-
man, M. S. 2020. Convolutional Neural Network Based En-
semble Approach for Homoglyph Recognition. Mathemati-
cal and Computational Applications, 25(4): 71.
Page, C. 2021. Microsoft secures court order to take
down malicious ‘homoglyph’ domains TechCrunch.
https://techcrunch.com/2021/07/19/microsoft-secures-
court-order-to-take-down-malicious-homoglpyh-domains/.
(Accessed on 08/16/2022).
Pan, S. J.; and Yang, Q. 2009. A survey on transfer learn-
ing. IEEE Transactions on knowledge and data engineering,
22(10): 1345–1359.
Reurink, A. 2018. Financial fraud: a literature review. Jour-
nal of Economic Surveys, 32(5): 1292–1325.
Ristad, E. S.; and Yianilos, P. N. 1998. Learning string-
edit distance. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(5): 522–532.
Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Con-
volutional networks for biomedical image segmentation. In
International Conference on Medical image computing and
computer-assisted intervention, 234–241. Springer.
Sern, L. J.; David, Y. G. P.; and Hao, C. J. 2020. Phish-
GAN: Data Augmentation and Identification of Homoglyph
Attacks. In 2020 International Conference on Communica-
tions, Computing, Cybersecurity, and Informatics (CCCI),
1–6. IEEE.
Shorten, C.; and Khoshgoftaar, T. M. 2019. A survey on
image data augmentation for deep learning. Journal of big
data, 6(1): 1–48.
Simonyan, K.; and Zisserman, A. 2014. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
Turkynewych, B. 2020. tb0hdan/domains: World’s sin-
gle largest Internet domains dataset. https://github.com/
tb0hdan/domains. (Accessed on 08/16/2022).
Ulikowski, M. 2015. elceef/dnstwist: Domain name per-
mutation engine for detecting homograph phishing attacks,
typo squatting, and brand impersonation. https://github.com/
elceef/dnstwist. (Accessed on 08/16/2022).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. Advances in neural information pro-
cessing systems, 30.
Woodbridge, J.; Anderson, H. S.; Ahuja, A.; and Grant, D.
2018. Detecting homoglyph attacks with a siamese neural
network. In 2018 IEEE Security and Privacy Workshops
(SPW), 22–28. IEEE.
... This study focuses on identifying fake domains that use homoglyphic substitutions in their names, analyzing their online presence, and assessing the extent to which these domains are used for phishing or other malicious activity. By creating homoglyphic variations of the most visited domains, this study aims to understand the current prevalence of such attacks and the potential threat they pose [1]. ...
Article
This paper investigates the existence of phishing websites with domain names that are visually similar to legitimate websites by replacing a single character in the original domain with a homoglyph. A program was developed to generate images of each valid character in domain names using a font commonly used in the address bar of the most popular web browser. The ResNet50 model was used to generate feature vectors for each character valid in the domain, and the cosine similarity percentage was calculated between all pairs of characters. A threshold value for the cosine similarity was chosen, and characters with similarities above the threshold were replaced. A total of 1241 fake domain names were generated by replacing a single character in the domain names of the 30 most visited websites in the world while preserving the original top-level domain (TLD), but the change in TLD was also done. Out of all these domains, 46.66% returned an HTTP status code. Manual analysis of websites with these domains found only one site with a similar domain name and website design. Additional results include domains for sale, empty websites, redirects to original sites, legitimate sites, and potentially dangerous sites.
Article
Full-text available
Homoglyphs are pairs of visual representations of Unicode characters that look similar to the human eye. Identifying homoglyphs is extremely useful for building a strong defence mechanism against many phishing and spoofing attacks, ID imitation, profanity abusing, etc. Although there is a list of discovered homoglyphs published by Unicode consortium, regular expansion of Unicode character scripts necessitates a robust and reliable algorithm that is capable of identifying all possible new homoglyphs. In this article, we first show that shallow Convolutional Neural Networks are capable of identifying homoglyphs. We propose two variations, both of which obtain very high accuracy (99.44%) on our benchmark dataset. We also report that adoption of transfer learning allows for another model to achieve 100% recall on our dataset. We ensemble these three methods to obtain 99.72% accuracy on our independent test dataset. These results illustrate the superiority of our ensembled model in detecting homoglyphs and suggest that our model can be used to detect new homoglyphs when increasing Unicode characters are added. As a by-product, we also prepare a benchmark dataset based on the currently available list of homoglyphs.
Article
Full-text available
Abstract Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data. Unfortunately, many application domains do not have access to big data, such as medical image analysis. This survey focuses on Data Augmentation, a data-space solution to the problem of limited data. Data Augmentation encompasses a suite of techniques that enhance the size and quality of training datasets such that better Deep Learning models can be built using them. The image augmentation algorithms discussed in this survey include geometric transformations, color space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning. The application of augmentation methods based on GANs are heavily covered in this survey. In addition to augmentation techniques, this paper will briefly discuss other characteristics of Data Augmentation such as test-time augmentation, resolution impact, final dataset size, and curriculum learning. This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing Data Augmentation. Readers will understand how Data Augmentation can improve the performance of their models and expand limited datasets to take advantage of the capabilities of big data.
Article
Full-text available
A data breach is the intentional or inadvertent exposure of confidential information to unauthorized parties. In the digital era, data has become one of the most critical components of an enterprise. Data leakage poses serious threats to organizations, including significant reputational damage and financial losses. As the volume of data is growing exponentially and data breaches are happening more frequently than ever before, detecting and preventing data loss has become one of the most pressing security concerns for enterprises. Despite a plethora of research efforts on safeguarding sensitive information from being leaked, it remains an active research problem. This review helps interested readers to learn about enterprise data leak threats, recent data leak incidents, various state‐of‐the‐art prevention and detection techniques, new challenges, and promising solutions and exciting opportunities. WIREs Data Mining Knowl Discov 2017, 7:e1211. doi: 10.1002/widm.1211 This article is categorized under: Application Areas > Business and Industry Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Technologies > Prediction
Conference Paper
The human eyes fall prey to cyber-attacks designed to visually deceive us. One such attack that has been effective is named spoofing or homoglyph attack. A homoglyph attack employs a simple trick to deceive us by using a spoof domain or process (file) name that is hard to distinguish from the legitimate names. Due to this, a user might be drawn to click on the spoof process or domain names, and in worst-case it results in triggering any malicious malware planted in them. As a result, their sensitive personal information might be at risk of being exposed. To address the problem mentioned above, existing works use simple approaches related to string comparison techniques that are extensively applied to compare genomes. Although they are effective, these methods are computationally expensive and suffer from low precision due to high false positive predictions. In recent years, machine learning has been applied to a variety of problems, and similar efforts have been made to address homoglyph attacks with neural networks to improve the efficiency of preemptive cyber-attack detection. However, both of these approaches have a common constraint, which is related to the requirement of paired sequences to determine the difference between real vs. spoof strings. As a result, existing approaches are not practical to real-world scenarios when paired sequences are unavailable. In this paper, we introduce a new unpaired homoglyph attack detection system using a convolutional neural network. We formulate two unpaired datasets based on the original datasets reported in [36], which contain real and spoof names for both domains and processes. We train the model end-to-end in a supervised manner. Our experiments demonstrate the robustness of our model in terms of performance in detecting homoglyph attacks. Additionally, it is easy to integrate into any browser with a simple REST [28] API. We show that our model can reach state-of-the-art in detecting homoglyph attack with 94% accuracy on the domain spoof dataset and 95% accuracy on process spoof dataset even without requiring paired data as input. We believe that this work is useful in real-world to appropriately safeguard sensitive information of the users from adversaries.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.