Takeshi Yagi’s research while affiliated with NTT and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (64)


Fig. 8 Distribution of IP address locations.
Fig. 9 Distribution of hosting providers.
Fig. 10 Distribution of frequent IP addresses and hosting providers.
Understanding Characteristics of Phishing Reports from Experts and Non-Experts on Twitter
  • Article
  • Full-text available

July 2024

·

63 Reads

IEICE Transactions on Information and Systems

Hiroki NAKANO

·

·

·

[...]

·

The increase in phishing attacks through email and short message service (SMS) has shown no signs of deceleration. The first thing we need to do to combat the ever-increasing number of phishing attacks is to collect and characterize more phishing cases that reach end users. Without understanding these characteristics, anti-phishing countermeasures cannot evolve. In this study, we propose an approach using Twitter as a new observation point to immediately collect and characterize phishing cases via e-mail and SMS that evade countermeasures and reach users. Specifically, we propose CrowdCanary, a system capable of structurally and accurately extracting phishing information (e.g., URLs and domains) from tweets about phishing by users who have actually discovered or encountered it. In our three months of live operation, CrowdCanary identified 35,432 phishing URLs out of 38,935 phishing reports. We confirmed that 31,960 (90.2%) of these phishing URLs were later detected by the anti-virus engine, demonstrating that CrowdCanary is superior to existing systems in both accuracy and volume of threat extraction. We also analyzed users who shared phishing threats by utilizing the extracted phishing URLs and categorized them into two distinct groups - namely, experts and non-experts. As a result, we found that CrowdCanary could collect information that is specifically included in non-expert reports, such as information shared only by the company brand name in the tweet, information about phishing attacks that we find only in the image of the tweet, and information about the landing page before the redirect. Furthermore, we conducted a detailed analysis of the collected information on phishing sites and discovered that certain biases exist in the domain names and hosting servers of phishing sites, revealing new characteristics useful for unknown phishing site detection.

Download


Canary in Twitter Mine: Collecting Phishing Reports from Experts and Non-experts

March 2023

·

160 Reads

The rise in phishing attacks via e-mail and short message service (SMS) has not slowed down at all. The first thing we need to do to combat the ever-increasing number of phishing attacks is to collect and characterize more phishing cases that reach end users. Without understanding these characteristics, anti-phishing countermeasures cannot evolve. In this study, we propose an approach using Twitter as a new observation point to immediately collect and characterize phishing cases via e-mail and SMS that evade countermeasures and reach users. Specifically, we propose CrowdCanary, a system capable of structurally and accurately extracting phishing information (e.g., URLs and domains) from tweets about phishing by users who have actually discovered or encountered it. In our three months of live operation, CrowdCanary identified 35,432 phishing URLs out of 38,935 phishing reports. We confirmed that 31,960 (90.2%) of these phishing URLs were later detected by the anti-virus engine, demonstrating that CrowdCanary is superior to existing systems in both accuracy and volume of threat extraction. We also analyzed users who shared phishing threats by utilizing the extracted phishing URLs and categorized them into two distinct groups - namely, experts and non-experts. As a result, we found that CrowdCanary could collect information that is specifically included in non-expert reports, such as information shared only by the company brand name in the tweet, information about phishing attacks that we find only in the image of the tweet, and information about the landing page before the redirect.


FIGURE 6. Overview of DomainPrio's steps to identify domains likely to cause repeated lookups: (i) extracting features, (ii) extracting labels, (iii) training, and (iv) predicting.
FIGURE 8. Timeline of Feature and Label Extraction. F -Days are used for feature extraction, L-Days for label extraction, P-Days for prediction.
DomainPrio: Prioritizing Domain Name Investigations to Improve SOC Efficiency

March 2022

·

50 Reads

·

4 Citations

IEEE Access

Security Operations Centers (SOCs) are in need of automation for triaging alerts. Current approaches focus on analyzing and enriching individual alerts. We take a different approach and analyze the population of alerts. In an observational study over 24 weeks, we find a surprising pattern: some domains get analyzed again and again by different analysts, without coming to a final evaluation. Overall, 19% of the domains trigger 74% of all investigations. The most time-consuming domains are classified as false positives and “potentially unwanted programs”—the lowest threat level. To increase SOC efficiency, we design DomainPrio, a tool that prioritizes domains that are likely to be the subject of repeated, incomplete investigations. This enables us to indicate to the first analyst encountering this domain that the investigation should be, if possible, completed on this first attempt, so future investigations on the same domain can be prevented. DomainPrio is able to predict these domains with 89% accuracy and does so with an interpretable and auditable logistic regression model. When evaluating our tool on 100 days of data from a production setting, we find that it can potentially reduce the number of alert investigations by up to 35%, presenting the SOC with very substantial efficiency gains.




Follow Your Silhouette: Identifying the Social Account of Website Visitors through User-Blocking Side Channel

February 2020

·

15 Reads

IEICE Transactions on Information and Systems

This paper presents a practical side-channel attack that identifies the social web service account of a visitor to an attacker's website. Our attack leverages the widely adopted user-blocking mechanism, abusing its inherent property that certain pages return different web content depending on whether a user is blocked from another user. Our key insight is that an account prepared by an attacker can hold an attacker-controllable binary state of blocking/non-blocking with respect to an arbitrary user on the same service; provided that the user is logged in to the service, this state can be retrieved as one-bit data through the conventional cross-site timing attack when a user visits the attacker's website. We generalize and refer to such a property as visibility control, which we consider as the fundamental assumption of our attack. Building on this primitive, we show that an attacker with a set of controlled accounts can gain a complete and flexible control over the data leaked through the side channel. Using this mechanism, we show that it is possible to design and implement a robust, large-scale user identification attack on a wide variety of social web services. To verify the feasibility of our attack, we perform an extensive empirical study using 16 popular social web services and demonstrate that at least 12 of these are vulnerable to our attack. Vulnerable services include not only popular social networking sites such as Twitter and Facebook, but also other types of web services that provide social features, e.g., eBay and Xbox Live. We also demonstrate that the attack can achieve nearly 100% accuracy and can finish within a sufficiently short time in a practical setting. We discuss the fundamental principles, practical aspects, and limitations of the attack as well as possible defenses. We have successfully addressed this attack by collaborative working with service providers and browser vendors.


Study on the Vulnerabilities of Free and Paid Mobile Apps Associated with Software Library

February 2020

·

110 Reads

·

5 Citations

IEICE Transactions on Information and Systems

This paper reports a large-scale study that aims to understand how mobile application (app) vulnerabilities are associated with software libraries. We analyze both free and paid apps. Studying paid apps was quite meaningful because it helped us understand how differences in app development/maintenance affect the vulnerabilities associated with libraries. We analyzed 30k free and paid apps collected from the official Android marketplace. Our extensive analyses revealed that approximately 70%/50% of vulnerabilities of free/paid apps stem from software libraries, particularly from third-party libraries. Somewhat paradoxically, we found that more expensive/popular paid apps tend to have more vulnerabilities. This comes from the fact that more expensive/popular paid apps tend to have more functionality, i.e., more code and libraries, which increases the probability of vulnerabilities. Based on our findings, we provide suggestions to stakeholders of mobile app distribution ecosystems.


Using Seq2Seq Model to Detect Infection Focusing on Behavioral Features of Processes

September 2019

·

42 Reads

Journal of Information Processing

Sophisticated cyber-attacks intended to earn money or steal confidential information, such as targeted attacks, have become a serious problem. Such attacks often use specially crafted malware, which utilizes the art of hiding such as by process injection. Thus, preventing intrusion using conventional countermeasures is difficult, so a countermeasure needs to be developed that prevents attackers from reaching their ultimate goal. Therefore, we propose a method for estimating process maliciousness by focusing on process behavior. In our proposal, we first use one Seq2Seq model to extract a feature vector sequence from a process behavior log. Then, we use another Seq2Seq model to estimate the process maliciousness score by classifying the obtained feature vectors. By applying Seq2Seq models stepwise, our proposal can compress behavioral logs and extract abstracted behavioral features. We present an experimental evaluation using logs when actual malware is executed. The obtained results show that malicious processes are classified with a highest Areas Under the Curve (AUC) of 0.979 and 80% TPR even when the FPR is 1%. Furthermore, the results of an experiment using the logs when simulated attacks are executed show our proposal can detect unknown malicious processes that do not appear in training data.


Efficient Dynamic Malware Analysis for Collecting HTTP Requests using Deep Learning

April 2019

·

99 Reads

·

4 Citations

IEICE Transactions on Information and Systems

Malware-infected hosts have typically been detected using network-based Intrusion Detection Systems on the basis of characteristic patterns of HTTP requests collected with dynamic malware analysis. Since attackers continuously modify malicious HTTP requests to evade detection, novel HTTP requests sent from new malware samples need to be exhaustively collected in order to maintain a high detection rate. However, analyzing all new malware samples for a long period is infeasible in a limited amount of time. Therefore, we propose a system for efficiently collecting HTTP requests with dynamic malware analysis. Specifically, our system analyzes a malware sample for a short period and then determines whether the analysis should be continued or suspended. Our system identifies malware samples whose analyses should be continued on the basis of the network behavior in their short-period analyses. To make an accurate determination, we focus on the fact that malware communications resemble natural language from the viewpoint of data structure. We apply the recursive neural network, which has recently exhibited high classification performance in the field of natural language processing, to our proposed system. In the evaluation with 42,856 malware samples, our proposed system collected 94% of novel HTTP requests and reduced analysis time by 82% in comparison with the system that continues all analyses.


Citations (50)


... In addition, it removes unimportant HTML elements and reduces OCR-extracted text by eliminating sentences with the smallest font size. The experimental results on a dataset of 1000 phishing and 1000 non-phishing websites (sourced from OpenPhish, PhishTank, and posts from Twitter using CrowdCanary [128]) showed that GPT-4V achieved the most outstanding accuracy of 99.2% and precision of 98.7% and a recall of 99.6%, while GPT-4 (normal mode) also performed well with 98.3% precision and 98.4% recall and accuracy. Notably, the authors emphasize the utility of LLMs, which eliminates the need for additional model training, as the LLMs handle phishing detection directly. ...

Reference:

A State-of-the-Art Review on Phishing Website Detection Techniques
Canary in Twitter Mine: Collecting Phishing Reports from Experts and Non-experts
  • Citing Conference Paper
  • August 2023

... However, cloud computing power, data security, data silos and other risks will inevitably become constraints for AI to win user trust, collect private data, and achieve largescale implementation. [24] While current centralized LLMs have privilege in some aspects like faster converging speed and lower communication cost with respect to decentralized architecture, Centralized architectures fall short in preventing privacy leakage, as privacy concerns have become a growing issue for users. Therefore, it is an urgent need for a practical and effective technique to alleviate the above problems and make the AI full of vitality again. ...

Classification of diversified web crawler accesses inspired by biological adaptation
  • Citing Article
  • January 2021

International Journal of Bio-Inspired Computation

... Issues such as cloud computing capacity limitations, data security concerns, and data silos are increasingly recognized as barriers to fostering trust, gathering private user data, and facilitating federated training [20]. Consequently, to mitigate these challenges, there arises a critical need for effective solutions in privacy-sensitive scenarios. ...

Classification of diversified web crawler accesses inspired by biological adaptation
  • Citing Article
  • January 2021

International Journal of Bio-Inspired Computation

... Outdated software: Failure to update mobile software can lead to vulnerabilities that can be exploited by attackers [19]. Outdated software may lack security patches or fixes that address known vulnerabilities. ...

Study on the Vulnerabilities of Free and Paid Mobile Apps Associated with Software Library
  • Citing Article
  • February 2020

IEICE Transactions on Information and Systems

... Once again WannaCry takes the first place with the highest number of these said requests. [11] Moreover, KPOT is the only malware that sent a single data using HTTPS POST method. Kelihos had the highest IP address connections of 11 however these were no longer responding, which is even more suspicious. ...

Efficient Dynamic Malware Analysis for Collecting HTTP Requests using Deep Learning

IEICE Transactions on Information and Systems

... Therefore, adversarial attacks and defenses are used for malware detection [53], in addition to edge computing and DL malware detection for IoT [54], and forensic neural networks for malware localization [55]. On the other hand, drive-by downloads are Infecting a user's computer or device when they visit a compromised or malicious website, often without their knowledge or consent [56]. Twitter has faced this event before [57], [58], therefore detection measures were a good research path to optimize it [59]. ...

Evasive Malicious Website Detection by Leveraging Redirection Subgraph Similarities
  • Citing Article
  • March 2019

IEICE Transactions on Information and Systems

... Over a span of past 10 years, machine learning has been the topic in focus of practitioners on account of its favourable performance and the convenient endto-end modality. The most famous among these neural networks is convolutional neural network (CNN), however, majority of CNN models are applied in image processing [20]. In order to achieve the application in sequence processing, M. Wang et al. [15] proposed a novel CNN model (genCNN) with the ability of predict the next word with the history of words of variable length. ...

Event De-Noising Convolutional Neural Network for Detecting Malicious URL Sequences from Proxy Logs

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences

... The small dataset used in Nagai et al. (2019) misses many high profile EKs (Angler, Nuclear, Neutrino, Rig, Fiesta), and, modelling redirects based on time alone is problematic. Redirection chains are mapped in Takata et al. (2018), but, content-based redirects are not considered. Shibahara et al. (2019) models redirections irrespective of occurrence, e.g. if URL is found in JS but wasn't accessed, it's still labelled as a redirect. ...

Identifying Evasive Code in Malicious Websites by Analyzing Redirection Differences
  • Citing Article
  • November 2018

IEICE Transactions on Information and Systems

... The second approach can occur if an attacker locks a user out of an account or overloads the service with too many requests to bring down the entire application. Attackers can combine these two of her approaches with her web servicespecific attacks to maximize damage (Watanabe, Shioji, Akiyama, Sasaoka, Yagi & Mori, 2018). ...

User Blocking Considered Harmful? An Attacker-Controllable Side Channel to Identify Social Accounts
  • Citing Conference Paper
  • April 2018