Preprint

Improving Homograph Attack Classification

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the author.

Abstract

A visual homograph attack is a way that the attacker deceives the web users about which domain they are visiting by exploiting forged domains that look similar to the genuine domains. T. Thao et al. (IFIP SEC'19) proposed a homograph classification by applying conventional supervised learning algorithms on the features extracted from a single-character-based Structural Similarity Index (SSIM). This paper aims to improve the classification accuracy by combining their SSIM features with 199 features extracted from a N-gram model and applying advanced ensemble learning algorithms. The experimental result showed that our proposed method could enhance even 1.81% of accuracy and reduce 2.15% of false-positive rate. Furthermore, existing work applied machine learning on some features without being able to explain why applying it can improve the accuracy. Even though the accuracy could be improved, understanding the ground-truth is also crucial. Therefore, in this paper, we conducted an error empirical analysis and could obtain several findings behind our proposed approach.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Cyber attackers create domain names that are visually similar to those of legitimate/popular brands by abusing valid internationalized domain names (IDNs). In this work, we systematize such domain names, which we call deceptive IDNs, and understand the risks associated with them. In particular, we propose a new system called DomainScouter to detect various deceptive IDNs and calculate a deceptive IDN score, a new metric indicating the number of users that are likely to be misled by a deceptive IDN. We perform a comprehensive measurement study on the identified deceptive IDNs using over 4.4 million registered IDNs under 570 top level domains (TLDs). The measurement results demonstrate that there are many previously unexplored deceptive IDNs targeting non-English brands or combining other domain squatting methods. Furthermore, we conduct online surveys to examine and highlight vulnerabilities in user perceptions when encountering such IDNs. Finally, we discuss the practical countermeasures that stakeholders can take against deceptive IDNs.
Article
Full-text available
Computing veterans remember an old habit of crossing zeros (?) in program listings to avoid confusing them with the letter O, in order to make sure the operator would type the program correctly into the computer. This habit, once necessary, has long been rendered obsolete by the increased availability of editing tools. However, the underlying problem of character resemblance is still there. Today it seems we may have to acquire a similar habit, this time to address an issue much more threatening than mere typos: security. Let us begin with a short recourse to history. On April 7, 2000 an anonymous site published a bogus story intimating that the company PairGain Technologies (NASDAQ:PAIR) was about to be acquired for approximately twice its market value. The site employed the look and feel of the Bloomberg news service, and thus appeared quite authentic to unsuspecting users. To disseminate the "news", a message containing a link to the story was simultaneously posted to the Yahoo message board dedicated to PairGain. The link referred to the phony site by its numerical IP address rather than by name, and thus obscured its true identity. Many readers were convinced by the Bloomberg look and feel, and accepted the story at face value despite its suspicious address. As a result, PairGain stock first jumped 31%, and then fell drastically, incurring severe losses to investors. Attacks like this are relatively easy to detect. A stronger variant of this hoax might have used a domain named bl00mberg. com, (with zeros replacing o's), but even the latter is easily distinguishable from the real thing. However, forthcoming Internet technologies have the potential to make such attacks much more elusive and devastating. A new initiative, promoted by a number of Internet standards bodies including IETF and IANA, allows one to register domain names in national alphabets. This way, for example, Russian news site "gazeta. ru" ("gazeta" means "newspaper" in Russian) might register a more appealing " . ". Far from buzzword compliance, the initiative caters to the genuine needs of non-English-speaking Internet users,, who currently find it difficult to access Web sites otherwise. Several alternative implementations are currently being considered, and we can expect the standardization process to be completed soon. The benefits of this initiative are indisputable. Yet the very idea of such an infrastructure is compromised by the peculiarities of world alphabets. Revisiting our newspaper example, one can observe that Russian letters ",,, " are indistinguishable in writing from their English counterparts. Some of the letters (such as "a") are close etymologically, while others look similar by sheer coincidence. For instance, Russian letter "p" is actually pronounced like "r", but the glyphs of the two letters are identical. As it happens, Russian is not the only such language; other Cyrillic languages may cause similar collisions. With the proposed infrastructure in place, numerous English domain names may be homographed-maliciously misspelled by substitution of non-Latin letters. For example, the Bloomberg attack could have been crafted much more skillfully, by registering a domain name bloomberg. com, where the letters "o" and/or "e" have been faked with Russian substitutes. Without adequate safety mechanisms, this scheme can easily mislead even the most cautious reader. 1 Incidentally, this domain has actually been registered. 2 According to Global Reach's report, the English-speaking population of the Internet was about 62% in 1998, and is forecasted to be as low as 37% by the end of 2002.
Chapter
Visual homograph attack is a way that the attackers deceive victims about what domain they are communicating with by exploiting the fact that many characters look alike. The attack is growing into a serious problem and raising broad attention in reality when recently many brand domains have been attacked such as apple.com (Apple Inc.), adobe.com (Adobe Systems Incorporated), lloydsbank.co.uk (Lloyds Bank), etc. Therefore, how to detect visual homograph becomes a hot topic both in industry and research community. Several existing papers and tools have been proposed to find some homographs of a given domain based on different subsets of certain look-alike characters, or based on an analysis on the registered International Domain Name (IDN) database. However, we still lack a scalable and systematic approach that can detect sufficient homographs registered by attackers with a high accuracy and low false positive rate. In this paper, we construct a classification model to detect homographs and potential homographs registered by attackers using machine learning on feasible and novel features which are the visual similarity on each character and some selected information from Whois. The implementation results show that our approach can bring up to 95.90% of accuracy with merely 3.27% of false positive rate. Furthermore, we also make an empirical analysis on the collected homographs and found some interesting statistics along with concrete misbehaviors and purposes of the attackers.
Machine Learning: Technical for AI Engineers
  • Andrew Ng
Andrew Ng, "Machine Learning: Technical for AI Engineers, In the Era of Deep Learning", 2018, Available: https://www.deeplearning.ai/ content/uploads/2018/09/Ng-MLY01-12.pdf.
Dnstwist: Domain name permutation engine for detecting typo squatting, phishing and corporate espionage
  • U Marcin
U. Marcin, "Dnstwist: Domain name permutation engine for detecting typo squatting, phishing and corporate espionage", 2018, Available: https://github.com/elceef/dnstwist
IDN Homograph Attack
  • F Timo
F. Timo, "IDN Homograph Attack", 2017, Available: https://github.com/ timofurrer/idn-homograph-attack
EvilURL: Generate Unicode evil domains for IDN Homograph Attack and detect them
  • M Alisson
  • A Vandre
M. Alisson and A. Vandre, "EvilURL: Generate Unicode evil domains for IDN Homograph Attack and detect them", 2018, Available: https: //github.com/UndeadSec/EvilURL
Phishing with Unicode Domains
  • Z Xudong
Z. Xudong, "Phishing with Unicode Domains", 2017, Available: https://www.xudongz.com/blog/2017/idn-phishing/? ga=2.53371112. 1302505681.1542677803-1987638994.1542677803
Unicode Security Mechanisms for UTS #39
  • Unicode Inc
Unicode Inc., Unicode Security Mechanisms for UTS #39, 2020. Available: https://www.unicode.org/Public/security/latest/confusables.txt