Conference Paper

Enabling Network Security Through Active DNS Datasets

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Most modern cyber crime leverages the Domain Name System (DNS) to attain high levels of network agility and make detection of Internet abuse challenging. The majority of malware, which represent a key component of illicit Internet operations, are programmed to locate the IP address of their command-and-control (C&C) server through DNS lookups. To make the malicious infrastructure both agile and resilient, malware authors often use sophisticated communication methods that utilize DNS (i.e., domain generation algorithms) for their campaigns. In general, Internet miscreants make extensive use of short-lived disposable domains to promote a large variety of threats and support their criminal network operations. To effectively combat Internet abuse, the security community needs access to freely available and open datasets. Such datasets will enable the development of new algorithms that can enable the early detection, tracking, and overall lifetime of modern Internet threats. To that end, we have created a system, Thales, that actively queries and collects records for massive amounts of domain names from various seeds. These seeds are collected from multiple public sources and, therefore, free of privacy concerns. The results of this effort will be opened and made freely available to the research community. With three case studies we demonstrate the detection merit that the collected active DNS datasets contain. We show that (i) more than 75 % of the domain names in public black lists (PBLs) appear in our datasets several weeks (and some cases months) in advance, (ii) existing DNS research can be implemented using only active DNS, and (iii) malicious campaigns can be identified with the signal provided by active DNS.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Many previous studies use passive measurement to observe DNS traffic on their networks [14,31,61,97,108]. However, passive data collection can suffer from bias depending on the time, location, and demographics of users within the observed network. ...
... There are also prior works (by both academia and industry) that conducted large-scale active DNS measurements for several purposes and made their datasets available to the community [61,90,104]. However, all of these datasets have two common issues that make them unsuitable to be used directly in our study. ...
... In addition, we also analyze reverse DNS data to examine the current adoption status of reverse mapping from IP to domain name, and compare it to our number of single-hosted domains found in §5.1, as this is another potential way for an adversary to infer a visited domain. The Active DNS Project [61] is currently collecting A records of about 300M domains derived from 1.3K TLD zone files on a daily basis. In addition to this effort, Rapid7 also conducts active DNS measurements at a large scale and offers researchers access to its data [90]. ...
Preprint
Full-text available
As Internet users have become more savvy about the potential for their Internet communication to be observed, the use of network traffic encryption technologies (e.g., HTTPS/TLS) is on the rise. However, even when encryption is enabled, users leak information about the domains they visit via their DNS queries and via the Server Name Indication (SNI) extension of TLS. Two proposals to ameliorate this issue are DNS over HTTPS/TLS (DoH/DoT) and Encrypted SNI (ESNI). In this paper we aim to assess the privacy benefits of these proposals by considering the relationship between hostnames and IP addresses, the latter of which are still exposed. We perform DNS queries from nine vantage points around the globe to characterize this relationship. We quantify the privacy gain due to ESNI for different hosting and CDN providers using two different metrics, the k-anonymity degree due to co-hosting and the dynamics of IP address changes. We find that 20% of the domains studied will not gain any privacy benefit since they have a one-to-one mapping between their hostname and IP address. Our results show that 30% will gain a high level of privacy benefit with a k value greater than 100, meaning that an adversary can correctly guess these domains with a probability less than 1%. Domains whose visitors will gain a high privacy level are far less popular, while visitors of popular sites will gain much less. Analyzing the dynamics of IP addresses of long-lived domains, we find that only 7.7% of them change their hosting IP addresses on a daily basis. We conclude by discussing potential approaches for website owners and hosting/CDN providers for maximizing the privacy benefits of ESNI.
... For years, TLD zone files have been a popular tool in the security community to find active domain names [12,30,34,36,37,43]. Zone files are publicly available to researchers who request access and they represent a snapshot of resolvable domains, but not all registered domains will appear in zone files. ...
... Because of this behavior, we conclude that zone files alone cannot be used to indicate whether a domain is registered or not. Therefore systems that rely on zone files [12,30,34,36,37,43] are bound to be missing domains that are temporarily suspended and ones that are registered but are missing. ...
... Another lesson from this study is that registration data including status, registrar, and dates should be maintained as a publicly available resource. Public access to zone files has been very successful in aiding security research and applications [12,30,34,36,37,43], but it is not enough to identify all registered domain names, nor does it cover all stages of the domain life-cycle making cases like early deletion and dropcatching more difficult to monitor. In the past, query limits on WHOIS were a reasonable precaution to prevent mass collection of registrants' personal information, but with recent changes to WHOIS privacy, largely driven by the EU's GDPR, this is no longer necessary. ...
Conference Paper
Full-text available
Domain names are a valuable resource on the web. Most domains are available to the public on a first-come, first-serve basis and once domains are purchased, the owners keep them for a period of at least one year before they may choose to renew them. Common wisdom suggests that even if a domain name stops being useful to its owner, the owner will merely wait until the domain organically expires and choose not to renew. In this paper, contrary to common wisdom, we report on the discovery that domain names are often deleted before their expiration date. This is concerning because this practice offers no advantage for legitimate users, while malicious actors deleting domains may hamper forensic analysis of malicious campaigns, and registrars deleting domains instead of suspending them enable re-registration and continued abuse. Specifically, we present the first systematic analysis of early domain name disappearances from the largest top-level domains (TLDs). We find more than 386,000 cases where domain names were deleted before expiring and we discover individuals with more than 1,000 domains deleted in a single day. Moreover, we identify the specific registrars that choose to delete domain names instead of suspending them. We compare lexical features of these domains, finding significant differences between domains that are deleted early, suspended, and organically expiring. Furthermore, we explore potential reasons for deletion finding over 7,000 domain names squatting more popular domains and more than 14,000 associated with malicious registrants.
... DNS data is one of the most notable sources of information utilized to detect malicious domains [55], [33]. In general, there are two types of approaches that complement each other. ...
... For example, Farsight passive DNS data [23] utilizes sensors deployed behind DNS resolvers and provides aggregate information about domain resolutions. Active DNS data is collected by periodically querying a large pre-compiled list of domains in the Internet (e.g., [33]). It is true that passive DNS data has been an invaluable source of information for detecting and mitigating malicious activities in the Internet [18], [27], [29]. ...
... It is true that passive DNS data has been an invaluable source of information for detecting and mitigating malicious activities in the Internet [18], [27], [29]. However, in this paper, we focus on active DNS data due to the difficulty of obtaining other types of DNS data including passive DNS data and logs of DNS servers because of sensitivity of information or financial costs [23], [33]. ...
Article
Inference based techniques are one of the major approaches to analyze DNS data and detecting malicious domains. The key idea of inference techniques is to first define associations between domains based on features extracted from DNS data. Then, an inference algorithm is deployed to infer potential malicious domains based on their direct/indirect associations with known malicious ones. The way associations are defined is key to the effectiveness of an inference technique. It is desirable to be both accurate (i.e., avoid falsely associating domains with no meaningful connections) and with good coverage (i.e., identify all associations between domains with meaningful connections). Due to the limited scope of information provided by DNS data, it becomes a challenge to design an association scheme that achieves both high accuracy and good coverage. In this paper, we propose a new association scheme to identify domains controlled by the same entity. Our key idea is an in-depth analysis of active DNS data to accurately separate public IPs from dedicated ones, which enables us to build high-quality associations between domains. Our scheme identifies many meaningful connections between domains that are discarded by existing state-of-the-art approaches. Our experimental results show that the proposed association scheme not only significantly improves the domain coverage compared to existing approaches but also achieves better detection accuracy. Existing path-based inference algorithm is specifically designed for DNS data analysis. It is effective but computationally expensive. As a solution, we investigate the effectiveness of combining our association scheme with the generic belief propagation algorithm. Through comprehensive experiments, we show that this approach offers significant efficiency and scalability improvement with only minor negative impact of detection accuracy.
... DNS data is one of the most notable sources of information utilized to detect malicious domains [27,43]. In general, there are two types of approaches that complement each other. ...
... DNS data can be obtained by deploying sensors in the DNS query process. Based on the location and the method of collection, the DNS data can be grouped into three main categories: passive DNS data [20,43], active DNS data [27], and logs of DNS servers. Our malicious domain detection approach uses active DNS data due to its readily availability with little privacy concerns compared to passive DNS data. ...
... Active DNS Dataset. Our experiments are performed on the active DNS datasets made available by the system called Thales by the Georgia Institute of Technology's Active DNS project [1,27]. Thales scans DNS using a set of seed domain list compiled from multiple sources, including public blacklists (e.g. ...
Conference Paper
Full-text available
Inference based techniques are one of the major approaches to analyze DNS data and detect malicious domains. The key idea of inference techniques is to first define associations between domains based on features extracted from DNS data. Then, an inference algorithm is deployed to infer potential malicious domains based on their direct/indirect associations with known malicious ones. The way associations are defined is key to the effectiveness of an inference technique. It is desirable to be both accurate (i.e., avoid falsely associating domains with no meaningful connections) and with good coverage (i.e., identify all associations between domains with meaningful connections). Due to the limited scope of information provided by DNS data, it becomes a challenge to design an association scheme that achieves both high accuracy and good coverage. In this paper, we propose a new approach to identify domains controlled by the same entity. Our key idea is an in-depth analysis of active DNS data to accurately separate public IPs from dedicated ones, which enables us to build high-quality associations between domains. Our scheme avoids the pitfall of naive approaches that rely on weak "co-IP" relationship of domains (i.e., two domains are resolved to the same IP) that results in low detection accuracy, and, meanwhile, identifies many meaningful connections between domains that are discarded by existing state-of-the-art approaches. Our experimental results show that the proposed approach not only significantly improves the domain coverage compared to existing approaches but also achieves better detection accuracy. Existing path-based inference algorithms are specifically designed for DNS data analysis. They are effective but computationally expensive. To further demonstrate the strength of our domain association scheme as well as improve the inference efficiency, we construct a new domain-IP graph that can work well with the generic belief propagation algorithm. Through comprehensive experiments, we show that this approach offers significant efficiency and scalability improvement with only a minor impact to detection accuracy, which suggests that such a combination could offer a good tradeoff for malicious domain detection in practice.
... In this section we categorize diferent types of DNS data, auxiliary information and ground truth that are used in the schemes proposed in the literature. The way these data are collected has a signiicant impact on the underlying assumptions and intuitions of malicious domain detection [47], BotGAD [45,46], Lee and Lee [100], Krishnan et al. [93], Manadhata et al. [109], Yadav et al. [165,166], Smash [171], Segugio [133,134], Perdisci et al. [128], Oprea et al. [126], Stalmans and Irwin [148] 1b: Exposure [35,36], Notos [28], Khalil et al. [82], Kopis [29], Huang and Greve [77], Yu et al. [169], Gao et al. [63], Mishsky et al. [112] 2. How the Data is Collected a) Active b) Passive 2a (Sources: Thales [91]): Holz et al. [73], Fluxor [127], Nazario and Holz [121], BDS [130], Konte et al. [89], Ma et al. [105], Felegyhazi et al. [57], Hao et al. [68], Predator [69], DomainProiler [44] 2b (Sources: Farsight database [55]): Choi et al. [47], BotGAD [45,46], Manadhata et al. [109], Exposure [35,36], Notos [28], Khalil et al. [82], Kopis [29], Huang and Greve [77], Yu et al. [169], Gao et al. [63], Yadav et al. [165,166], Smash [171], Segugio [133,134] ...
... Active DNS data collection. To actively obtain DNS data, a data collector would deliberately send DNS queries and record the corresponding DNS responses [44,73,89,91,105,121,127]. The list of queried domains is built thanks to multiple sources, typical ones include popular domains lists such as the Alexa Top Sites [25], domains appearing in various blacklists, or those from the zone iles of authoritative servers. ...
... The primary reason lies in the diiculty to make publicly available a set of common or comparable reference datasets. Although currently there are several publicly available DNS datasets, which have been collected passively (e.g., from Farsight [55]) or actively (e.g., Thales [91]), they cannot be used in many approaches, especially in those relying on client-side patterns [109,126]. It should be also noted that despite some approaches may work on data collected both actively and passively (for instance, the one proposed by Khalil et al. [82] which relies on domain co-location information obtainable from both datasets), such a comparison has never been performed before. ...
Article
Malicious domains are one of the major resources required for adversaries to run attacks over the Internet. Due to the important role of the Domain Name System (DNS), extensive research has been conducted to identify malicious domains based on their unique behavior relected in diferent phases of the life cycle of DNS queries and responses. Existing approaches difer signiicantly in terms of intuitions, data analysis methods as well as evaluation methodologies. This warrants a thorough systematization of the approaches and a careful review of the advantages and limitations of every group. In this paper, we perform such an analysis. In order to achieve this goal, we present the necessary background knowledge on DNS and malicious activities leveraging DNS. We describe a general framework of malicious domain detection techniques using DNS data. Applying this framework, we categorize existing approaches using several orthogonal viewpoints, namely (1) sources of DNS data and their enrichment, (2) data analysis methods, and (3) evaluation strategies and metrics. In each aspect, we discuss the important challenges that the research community should address in order to fully realize the power of DNS data analysis to ight against attacks leveraging malicious domains.
... Using passive measurement, DNS data is obtained by an entity who is in a position to capture DNS traffic from the network infrastructure under its control (e.g., networks of academic institutes or small organizations) [43]. Several previous studies use passive measurement to observe DNS traffic [5,8,20,37,43]. Passive measurements, however, may introduce bias in the data collected depending on the time, location, and demographics of users within the monitored network. Moreover, another issue with passive data collection is ethics, as data gathered over a long period of time can reveal online habits of monitored users. ...
... Researchers can choose which domains to resolve depending on the goals of their study, thus having more control over the collected data. Although this approach can remedy the privacy issue of passive DNS measurement, it requires an increased amount of resources for running a dedicated measurement infrastructure if there is a large number of domains that need to be resolved [20]. There are prior works that have been conducting large-scale active DNS measurements for different purposes and provide their datasets to the community [20,33]. ...
... Although this approach can remedy the privacy issue of passive DNS measurement, it requires an increased amount of resources for running a dedicated measurement infrastructure if there is a large number of domains that need to be resolved [20]. There are prior works that have been conducting large-scale active DNS measurements for different purposes and provide their datasets to the community [20,33]. ...
Preprint
Full-text available
Understanding web co-location is essential for various reasons. For instance, it can help one to assess the collateral damage that denial-of-service attacks or IP-based blocking can cause to the availability of co-located web sites. However, it has been more than a decade since the first study was conducted in 2007. The Internet infrastructure has changed drastically since then, necessitating a renewed study to comprehend the nature of web co-location. In this paper, we conduct an empirical study to revisit web co-location using datasets collected from active DNS measurements. Our results show that the web is still small and centralized to a handful of hosting providers. More specifically, we find that more than 60% of web sites are co-located with at least ten other web sites---a group comprising less popular web sites. In contrast, 17.5% of mostly popular web sites are served from their own servers. Although a high degree of web co-location could make co-hosted sites vulnerable to DoS attacks, our findings show that it is an increasing trend to co-host many web sites and serve them from well-provisioned content delivery networks (CDN) of major providers that provide advanced DoS protection benefits. Regardless of the high degree of web co-location, our analyses of popular block lists indicate that IP-based blocking does not cause severe collateral damage as previously thought.
... In this section we categorize different types of DNS data, auxiliary information and ground truth that are used in the schemes proposed in the literature. The way these data are collected has a significant impact on the underlying assumptions and intuitions of malicious domain detection [47], BotGAD [45,46], Lee and Lee [100], Krishnan et al. [93], Manadhata et al. [109], Yadav et al. [165,166], Smash [171], Segugio [133,134], Perdisci et al. [128], Oprea et al. [126], Stalmans and Irwin [148] 1b: Exposure [35,36], Notos [28], Khalil et al. [82], Kopis [29], Huang and Greve [77], Yu et al. [169], Gao et al. [63], Mishsky et al. [112] 2. How the Data is Collected a) Active b) Passive 2a (Sources: Thales [91]): Holz et al. [73], Fluxor [127], Nazario and Holz [121], BDS [130], Konte et al. [89], Ma et al. [105], Felegyhazi et al. [57], Hao et al. [68], Predator [69], DomainProfiler [44] 2b (Sources: Farsight database [55]): Choi et al. [47], BotGAD [45,46], Manadhata et al. [109], Exposure [35,36], Notos [28], Khalil et al. [82], Kopis [29], Huang and Greve [77], Yu et al. [169], Gao et al. [63], Yadav et al. [165,166], Smash [171], Segugio [133,134] ...
... Active DNS data collection. To actively obtain DNS data, a data collector would deliberately send DNS queries and record the corresponding DNS responses [44,73,89,91,105,121,127]. The list of queried domains is built thanks to multiple sources, typical ones include popular domains lists such as the Alexa Top Sites [25], domains appearing in various blacklists, or those from the zone files of authoritative servers. ...
... The primary reason lies in the difficulty to make publicly available a set of common or comparable reference datasets. Although currently there are several publicly available DNS datasets, which have been collected passively (e.g., from Farsight [55]) or actively (e.g., Thales [91]), they cannot be used in many approaches, especially in those relying on client-side patterns [109,126]. It should be also noted that despite some approaches may work on data collected both actively and passively (for instance, the one proposed by Khalil et al. [82] which relies on domain co-location information obtainable from both datasets), such a comparison has never been performed before. ...
Preprint
Full-text available
Malicious domains are one of the major resources required for adversaries to run attacks over the Internet. Due to the important role of the Domain Name System (DNS), extensive research has been conducted to identify malicious domains based on their unique behavior reflected in different phases of the life cycle of DNS queries and responses. Existing approaches differ significantly in terms of intuitions, data analysis methods as well as evaluation methodologies. This warrants a thorough systematization of the approaches and a careful review of the advantages and limitations of every group. In this paper, we perform such an analysis. In order to achieve this goal, we present the necessary background knowledge on DNS and malicious activities leveraging DNS. We describe a general framework of malicious domain detection techniques using DNS data. Applying this framework, we categorize existing approaches using several orthogonal viewpoints, namely (1) sources of DNS data and their enrichment, (2) data analysis methods, and (3) evaluation strategies and metrics. In each aspect, we discuss the important challenges that the research community should address in order to fully realize the power of DNS data analysis to fight against attacks leveraging malicious domains.
... In this section we categorize diferent types of DNS data, auxiliary information and ground truth that are used in the schemes proposed in the literature. The way these data are collected has a signiicant impact on the underlying assumptions and intuitions of malicious domain detection [47], BotGAD [45,46], Lee and Lee [100], Krishnan et al. [93], Manadhata et al. [109], Yadav et al. [165,166], Smash [171], Segugio [133,134], Perdisci et al. [128], Oprea et al. [126], Stalmans and Irwin [148] 1b: Exposure [35,36], Notos [28], Khalil et al. [82], Kopis [29], Huang and Greve [77], Yu et al. [169], Gao et al. [63], Mishsky et al. [112] 2. How the Data is Collected a) Active b) Passive 2a (Sources: Thales [91]): Holz et al. [73], Fluxor [127], Nazario and Holz [121], BDS [130], Konte et al. [89], Ma et al. [105], Felegyhazi et al. [57], Hao et al. [68], Predator [69], DomainProiler [44] 2b (Sources: Farsight database [55]): Choi et al. [47], BotGAD [45,46], Manadhata et al. [109], Exposure [35,36], Notos [28], Khalil et al. [82], Kopis [29], Huang and Greve [77], Yu et al. [169], Gao et al. [63], Yadav et al. [165,166], Smash [171], Segugio [133,134] ...
... Active DNS data collection. To actively obtain DNS data, a data collector would deliberately send DNS queries and record the corresponding DNS responses [44,73,89,91,105,121,127]. The list of queried domains is built thanks to multiple sources, typical ones include popular domains lists such as the Alexa Top Sites [25], domains appearing in various blacklists, or those from the zone iles of authoritative servers. ...
... The primary reason lies in the diiculty to make publicly available a set of common or comparable reference datasets. Although currently there are several publicly available DNS datasets, which have been collected passively (e.g., from Farsight [55]) or actively (e.g., Thales [91]), they cannot be used in many approaches, especially in those relying on client-side patterns [109,126]. It should be also noted that despite some approaches may work on data collected both actively and passively (for instance, the one proposed by Khalil et al. [82] which relies on domain co-location information obtainable from both datasets), such a comparison has never been performed before. ...
Article
Malicious domains are one of the major resources required for adversaries to run attacks over the Internet. Due to the important role of the Domain Name System (DNS), extensive research has been conducted to identify malicious domains based on their unique behavior reflected in different phases of the life cycle of DNS queries and responses. Existing approaches differ significantly in terms of intuitions, data analysis methods as well as evaluation methodologies. This warrants a thorough systematization of the approaches and a careful review of the advantages and limitations of every group. In this article, we perform such an analysis. To achieve this goal, we present the necessary background knowledge on DNS and malicious activities leveraging DNS. We describe a general framework of malicious domain detection techniques using DNS data. Applying this framework, we categorize existing approaches using several orthogonal viewpoints, namely (1) sources of DNS data and their enrichment, (2) data analysis methods, and (3) evaluation strategies and metrics. In each aspect, we discuss the important challenges that the research community should address in order to fully realize the power of DNS data analysis to fight against attacks leveraging malicious domains.
... Using passive measurement, DNS data is obtained by an entity who is in a position to capture DNS trac from the network infrastructure under its control (e.g., networks of academic institutes or small organizations) [43]. Several previous studies use passive measurement to observe DNS trac [5,8,20,37,43]. Passive measurements, however, may introduce bias in the data collected depending on the time, location, and demographics of users within the monitored network. Moreover, another issue with passive data collection is ethics, as data gathered over a long period of time can reveal online habits of monitored users. ...
... Researchers can choose which domains to resolve depending on the goals of their study, thus having more control over the collected data. Although this approach can remedy the privacy issue of passive DNS measurement, it requires an increased amount of resources for running a dedicated measurement infrastructure if there is a large number of domains that need to be resolved [20]. There are prior works that have been conducting large-scale active DNS measurements for dierent purposes and provide their datasets to the community [20,33]. ...
... Although this approach can remedy the privacy issue of passive DNS measurement, it requires an increased amount of resources for running a dedicated measurement infrastructure if there is a large number of domains that need to be resolved [20]. There are prior works that have been conducting large-scale active DNS measurements for dierent purposes and provide their datasets to the community [20,33]. ...
Article
Full-text available
Understanding web co-location is essential for various reasons. For instance, it can help one to assess the collateral damage that denial-of-service attacks or IP-based blocking can cause to the availability of co-located web sites. However, it has been more than a decade since the first study was conducted in 2007. The Internet infrastructure has changed drastically since then, necessitating a renewed study to comprehend the nature of web co-location. In this paper, we conduct an empirical study to revisit web co-location using datasets collected from active DNS measurements. Our results show that the web is still small and centralized to a handful of hosting providers. More specifically, we find that more than 60% of web sites are co-located with at least ten other web sites---a group comprising less popular web sites. In contrast, 17.5% of mostly popular web sites are served from their own servers. Although a high degree of web co-location could make co-hosted sites vulnerable to DoS attacks, our findings show that it is an increasing trend to co-host many web sites and serve them from well-provisioned content delivery networks (CDN) of major providers that provide advanced DoS protection benefits. Regardless of the high degree of web co-location, our analyses of popular block lists indicate that IP-based blocking does not cause severe collateral damage as previously thought.
... In this section we categorize diferent types of DNS data, auxiliary information and ground truth that are used in the schemes proposed in the literature. The way these data are collected has a signiicant impact on the underlying assumptions and intuitions of malicious domain detection [47], BotGAD [45,46], Lee and Lee [100], Krishnan et al. [93], Manadhata et al. [109], Yadav et al. [165,166], Smash [171], Segugio [133,134], Perdisci et al. [128], Oprea et al. [126], Stalmans and Irwin [148] 1b: Exposure [35,36], Notos [28], Khalil et al. [82], Kopis [29], Huang and Greve [77], Yu et al. [169], Gao et al. [63], Mishsky et al. [112] 2. How the Data is Collected a) Active b) Passive 2a (Sources: Thales [91]): Holz et al. [73], Fluxor [127], Nazario and Holz [121], BDS [130], Konte et al. [89], Ma et al. [105], Felegyhazi et al. [57], Hao et al. [68], Predator [69], DomainProiler [44] 2b (Sources: Farsight database [55]): Choi et al. [47], BotGAD [45,46], Manadhata et al. [109], Exposure [35,36], Notos [28], Khalil et al. [82], Kopis [29], Huang and Greve [77], Yu et al. [169], Gao et al. [63], Yadav et al. [165,166], Smash [171], Segugio [133,134] ...
... Active DNS data collection. To actively obtain DNS data, a data collector would deliberately send DNS queries and record the corresponding DNS responses [44,73,89,91,105,121,127]. The list of queried domains is built thanks to multiple sources, typical ones include popular domains lists such as the Alexa Top Sites [25], domains appearing in various blacklists, or those from the zone iles of authoritative servers. ...
... The primary reason lies in the diiculty to make publicly available a set of common or comparable reference datasets. Although currently there are several publicly available DNS datasets, which have been collected passively (e.g., from Farsight [55]) or actively (e.g., Thales [91]), they cannot be used in many approaches, especially in those relying on client-side patterns [109,126]. It should be also noted that despite some approaches may work on data collected both actively and passively (for instance, the one proposed by Khalil et al. [82] which relies on domain co-location information obtainable from both datasets), such a comparison has never been performed before. ...
... In exploring the DNS research methods, we have investigated various data analysis methods, which have been utilized to detect, model, and mitigate the aforementioned DNS threats. First, we describe two main DNS data collection methods utilized in the literature and the associated works, including passive DNS data (PDNS) [80,47,81,82,83,12,11,84,48,41,67,22,51] and Active DNS data (ADNS) [67,83,85,54,41,86]. Next, we categorize the research works based on the common DNS data analysis techniques that have been used in the literature, such as machine learning algorithms [80,17,46,53,22,54,55,56,87,88,89,76] and association analysis [59,82,90,54,91]. ...
... In exploring the DNS research methods, we have investigated various data analysis methods, which have been utilized to detect, model, and mitigate the aforementioned DNS threats. First, we describe two main DNS data collection methods utilized in the literature and the associated works, including passive DNS data (PDNS) [80,47,81,82,83,12,11,84,48,41,67,22,51] and Active DNS data (ADNS) [67,83,85,54,41,86]. Next, we categorize the research works based on the common DNS data analysis techniques that have been used in the literature, such as machine learning algorithms [80,17,46,53,22,54,55,56,87,88,89,76] and association analysis [59,82,90,54,91]. ...
... High-level architecture of passive DNS measurement systems is shown in Figure 11. Passive DNS databases based on valuable information they collect, have been considered as an invaluable asset of cybersecurity researchers to combat a wide range of threats such as malware, botnets, and malicious actors [80,47,81,82,83,12,11,84,48,41,67,22,51]. on the performance of DNS caching. ...
Preprint
Full-text available
The domain name system (DNS) is one of the most important components of today's Internet, and is the standard naming convention between human-readable domain names and machine-routable IP addresses of Internet resources. However, due to the vulnerability of DNS to various threats, its security and functionality have been continuously challenged over the course of time. Although, researchers have addressed various aspects of the DNS in the literature, there are still many challenges yet to be addressed. In order to comprehensively understand the root causes of the vulnerabilities of DNS, it is mandatory to review the various activities in the research community on DNS landscape. To this end, this paper surveys more than 170 peer-reviewed papers, which are published in both top conferences and journals in the last ten years, and summarizes vulnerabilities in DNS and corresponding countermeasures. This paper not only focuses on the DNS threat landscape and existing challenges, but also discusses the utilized data analysis methods, which are frequently used to address DNS threat vulnerabilities. Furthermore, we looked into the DNSthreat landscape from the viewpoint of the involved entities in the DNS infrastructure in an attempt to point out more vulnerable entities in the system.
... DNS data is one of the most notable sources of information utilized to detect malicious domains [Kountouras et al. 2016;Weimer 2005]. In general, there are two types of approaches that complement each other. ...
... For example, Farsight passive DNS data [Farsight Security, Inc. 2019] utilizes sensors deployed behind DNS resolvers and provides aggregate information about domain resolutions. Active DNS data is collected by periodically querying a large pre-compiled list of domains in the Internet (e.g., Kountouras et al. [2016]). Passive DNS data has been an invaluable source of information for detecting and mitigating malicious activities in the Internet [Antonakakis et al. 2012;J. ...
Article
Full-text available
Malicious domains, including phishing websites, spam servers, and command and control servers, are the reason for many of the cyber attacks nowadays. Thus, detecting them in a timely manner is important to not only identify cyber attacks but also take preventive measures. There has been a plethora of techniques proposed to detect malicious domains by analyzing Domain Name System (DNS) traffic data. Traditionally, DNS acts as an Internet miscreant’s best friend, but we observe that the subtle traces in DNS logs left by such miscreants can be used against them to detect malicious domains. Our approach is to build a set of domain graphs by connecting “related” domains together and injecting known malicious and benign domains into these graphs so that we can make inferences about the other domains in the domain graphs. A key challenge in building these graphs is how to accurately identify related domains so that incorrect associations are minimized and the number of domains connected from the dataset is maximized. Based on our observations, we first train two classifiers and then devise a set of association rules that assist in linking domains together. We perform an in-depth empirical analysis of the graphs built using these association rules on passive DNS data and show that our techniques can detect many more malicious domains than the state-of-the-art.
... Originally designed to translate user-friendly domain names into IP addresses, the DNS nowadays is also (ab)used for many di erent purposes, such as operating blacklists of spam email senders, tunneling tra c, enabling cybercriminals to operate moving command and control infrastructures, etc. Inspecting DNS data has thus become an ideal way to monitor di erent aspects of Internet activity, including cyber security aspects. In fact, the security community has been using the DNS extensively to detect Internet abuses [2,3,6,15,22,24,28,53]. As mentioned earlier, He et al. [23] leverage DNS data to study usage patterns of the two cloud providers Amazon EC2 and Microsoft Azure. ...
... DNS data is usually collected in two ways: (1) by passively recording DNS queries and responses that are made to some collector DNS servers [51] or (2) by actively querying some domains [28,47]. While passive DNS data is considered easier to collect and is by far the most common type of DNS data, both approaches appear to be complementary to obtain the best data coverage, in terms of number of domains, possible. ...
Conference Paper
Full-text available
Most websites, services, and applications have come to rely on Internet services (e.g., DNS, CDN, email, WWW, etc.) offered by third parties. Although employing such services generally improves reliability and cost-effectiveness, it also creates dependencies on service providers, which may expose websites to additional risks, such as DDoS attacks or cascading failures. As cloud services are becoming more popular, an increasing percentage of the overall Internet ecosystem relies on a decreasing number of highly popular services. In our general effort to assess the security risk for a given entity, and motivated by the effects of recent service disruptions, we perform a large-scale analysis of passive and active DNS datasets including more than 2.5 trillion queries in order to discover the dependencies between websites and Internet services. In this paper, we present the findings of our DNS dataset analysis, and attempt to expose important insights about the ecosystem of dependencies. To further understand the nature of dependencies, we perform graph-theoretic analysis on the dependency graph and propose support power, a novel power measure that can quantify the amount of dependence websites and other services have on a particular service. Our DNS analysis findings reveal that the current service ecosystem is dominated by a handful of popular service providers---with Amazon being the leader, by far---whose popularity is steadily increasing. These findings are further supported by our graph analysis results, which also reveals a set of less-popular services that many (regional) websites depend on.
... Active DNS data sets (collected by e.g., OpenINTEL [91]) rely on scanning zone files or popular domains to obtain these records, while passive DNS data sets (collected by e.g., Farsight Security [32]) extract them from monitored DNS responses. Both types of data sets have been used to detect malicious domain registrations and activity [19], [52], [84]. ...
... Passive DNS data collection may also have privacy implications [52], and requires sufficient storage and processing resources. Active DNS data collection has similar storage and resource needs, especially to ensure that records are updated sufficiently frequently. ...
... As an example, the connection between non-residential IPs and web services can be captured by the average number of TLD+3 domains per IP in the direct inetnum ( §II). Intuitively, this feature describes the number of domains hosted in the direct inetnum of this IP, which were found from Active DNS dataset [68]. Our evaluation on the labeled set shows that non-residential IPs have 5.49 as the average feature value while residential IPs only have 0.016. ...
... 68.92% are labeled as malware sites, 29.97% being malicious sites and 2.24% being phishing sites). Examples include ntkrnlpa.cn, ...
Conference Paper
Full-text available
An emerging Internet business is residential proxy (RESIP) as a service, in which a provider utilizes the hosts within residential networks (in contrast to those running in a datacenter) to relay their customers' traffic, in an attempt to avoid server-side blocking and detection. With the prominent roles the services could play in the underground business world, little has been done to understand whether they are indeed involved in Cybercrimes and how they operate, due to the challenges in identifying their RESIPs, not to mention any in-depth analysis on them. In this paper, we report the first study on RESIPs, which sheds light on the behaviors and the ecosystem of these elusive gray services. Our research employed an infiltration framework, including our clients for RESIP services and the servers they visited, to detect 6 million RESIP IPs across 230+ countries and 52K+ ISPs. The observed addresses were analyzed and the hosts behind them were further fingerprinted using a new profiling system. Our effort led to several surprising findings about the RESIP services unknown before. Surprisingly, despite the providers' claim that the proxy hosts are willingly joined, many proxies run on likely compromised hosts including IoT devices. Through cross-matching the hosts we discovered and labeled PUP (potentially unwanted programs) logs provided by a leading IT company, we uncovered various illicit operations RESIP hosts performed, including illegal promotion, Fast fluxing, phishing, malware hosting, and others. We also reverse engineered RESIP services' internal infrastructures, uncovered their potential rebranding and reselling behaviors. Our research takes the first step toward understanding this new Internet service, contributing to the effective control of their security risks.
... However, if this is not the case in the future, it would be easy to find alternatives. Given that the important features we extract depend more on the temporal registration information than the contact details of the registrant, we could replace our WHOIS features with DNS tracking systems like Active DNS [23] or the Alembic system [25]. ...
Conference Paper
Full-text available
Modern malware typically makes use of a domain generation algorithm (DGA) to avoid command and control domains or IPs being seized or sinkholed. This means that an infected system may attempt to access many domains in an attempt to contact the command and control server. Therefore, the automatic detection of DGA domains is an important task, both for the sake of blocking malicious domains and identifying compromised hosts. However, many DGAs use English wordlists to generate plausibly clean-looking domain names; this makes automatic detection difficult. In this work, we devise a notion of difficulty for DGA families called the smashword score; this measures how much a DGA family looks like English words. We find that this measure accurately reflects how much a DGA family's domains look like they are made from natural English words. We then describe our new modeling approach, which is a combination of a novel recurrent neural network architecture with domain registration side information. Our experiments show the model is capable of effectively identifying domains generated by difficult DGA families. Our experiments also show that our model outperforms existing approaches, and is able to reliably detect difficult DGA families such as matsnu, suppobox, rovnix, and others. The model's performance compared to the state of the art is best for DGA families that resemble English words. We believe that this model could either be used in a standalone DGA domain detector---such as an endpoint security application---or alternately the model could be used as a part of a larger malware detection system.
... However, if this is not the case in the future, it would be easy to find alternatives. Given that the important features we extract depend more on the temporal registration information than the contact details of the registrant, we could replace our WHOIS features with DNS tracking systems like Active DNS [23] or the Alembic system [25]. ...
Preprint
Full-text available
Modern malware typically makes use of a domain generation algorithm (DGA) to avoid command and control domains or IPs being seized or sinkholed. This means that an infected system may attempt to access many domains in an attempt to contact the command and control server. Therefore, the automatic detection of DGA domains is an important task, both for the sake of blocking malicious domains and identifying compromised hosts. However, many DGAs use English wordlists to generate plausibly clean-looking domain names; this makes automatic detection difficult. In this work, we devise a notion of difficulty for DGA families called the smashword score; this measures how much a DGA family looks like English words. We find that this measure accurately reflects how much a DGA family's domains look like they are made from natural English words. We then describe our new modeling approach, which is a combination of a novel recurrent neural network architecture with domain registration side information. Our experiments show the model is capable of effectively identifying domains generated by difficult DGA families. Our experiments also show that our model outperforms existing approaches, and is able to reliably detect difficult DGA families such as matsnu, suppobox, rovnix, and others. The model's performance compared to the state of the art is best for DGA families that resemble English words. We believe that this model could either be used in a standalone DGA domain detector---such as an endpoint security application---or alternately the model could be used as a part of a larger malware detection system.
... ere are many additional data sources which are now publicly available, but most are not suitable as-is for training and testing these systems. Some are more specialized datasets, such as the Active DNS Project [35], many contain only malicious tra c, and some data sources are both specialized and malicious, such as only containing peer-to-peer botnet command and control tra c. ...
Preprint
Metrics and frameworks to quantifiably assess security measures have arisen from needs of three distinct research communities - statistical measures from the intrusion detection and prevention literature, evaluation of cyber exercises, e.g., red-team and capture-the-flag competitions, and economic analyses addressing cost-versus-security tradeoffs. In this paper we provide two primary contributions to the security evaluation literature - a representative survey, and a novel framework for evaluating security that is flexible, applicable to all three use cases, and readily interpretable. In our survey of the literature we identify the distinct themes from each community's evaluation procedures side by side and flesh out the drawbacks and benefits of each. Next, we provide a framework for evaluating security by comprehensively modeling the resource, labor, and attack costs in dollars incurred based on quantities, accuracy metrics, and time. This framework is a more "holistic" approach in that it incorporates the accuracy and performance metrics, which dominate intrusion detection evaluation, the time to detection and impact to data and resources of an attack, favored by educational competitions' metrics, and the monetary cost of many essential security components used in financial analysis. Moreover, it is flexible enough to accommodate each use case, easily interpretable, and comprehensive in terms of costs considered. Finally, we provide two examples of the framework applied to real-world use cases. Overall, we provide a survey and a grounded, flexible framework and multiple concrete examples for evaluating security that addresses the needs of three, currently distinct communities.
... This is similar to [8], in which a dataset is constructed and features are determined both via typical measures such as IP address, number of transmitted packets, and ports, as well as more intensive methods such as a clustering of NetFlows. In [9], a comparative evaluation of different activation functions is performed on the UNSW-NB15 dataset. Various models are trained with different activation functions and configurations, including Tanh, Tanh with dropout, Rectifier, Rectifier with dropout, Maxout and Maxout with dropout. ...
... Several studies have surely focused on collecting DNS datasets. VOLUME 4, 2016 For example, Kountouras et al. [54] implemented a system, Thales, that creates massive amounts of malicious domain names by distilling freely available and multiple sources. Pearce et al. [55] developed a scalable, accurate, and ethical system, Iris, that measures global name resolution with active manipulation for tracking the trends of domain names that evolve over time. ...
Article
Full-text available
Some of the most serious security threats facing computer networks involve malware. To prevent this threat, administrators need to swiftly remove the infected machines from their networks. One common way to detect infected machines in a network is by monitoring communications based on blacklists. However, detection using this method has the following two problems: no blacklist is completely reliable, and blacklists do not provide sufficient evidence to allow administrators to determine the validity and accuracy of the detection results. Therefore, simply matching communications with blacklist entries is insufficient, and administrators should pursue their detection causes by investigating the communications themselves. In this paper, we propose an approach for classifying malicious DNS queries detected through blacklists by their causes. This approach is motivated by the following observation: a malware communication is divided into several transactions, each of which generates queries related to the malware; thus, surrounding queries that occur before and after a malicious query detected through blacklists help in estimating the cause of the malicious query. Our cause-based classification drastically reduces the number of malicious queries to be investigated because the investigation scope is limited to only representative queries in the classification results. In experiments, we have confirmed that our approach could group 388 malicious queries into 3 clusters, each consisting of queries with a common cause. These results indicate that administrators can briefly pursue all the causes by investigating only representative queries of each cluster, and thereby swiftly address the problem of infected machines in the network.
... [57] Active DNS Project The freely available data contains more than a terabyte of unprocessed DNS packet captures (PCAPs) along with tens of gigabytes of de-duplicated DNS records per day. Thus, the active DNS datasets represent a significant portion of the world's daily DNS delegation hierarchy [94]. ...
Preprint
This survey focuses on intrusion detection systems (IDS) that leverage host-based data sources for detecting attacks on enterprise network. The host-based IDS (HIDS) literature is organized by the input data source, presenting targeted sub-surveys of HIDS research leveraging system logs, audit data, Windows Registry, file systems, and program analysis. While system calls are generally included in audit data, several publicly available system call datasets have spawned a flurry of IDS research on this topic, which merits a separate section. Similarly, a section surveying algorithmic developments that are applicable to HIDS but tested on network data sets is included, as this is a large and growing area of applicable literature. To accommodate current researchers, a supplementary section giving descriptions of publicly available datasets is included, outlining their characteristics and shortcomings when used for IDS evaluation. Related surveys are organized and described. All sections are accompanied by tables concisely organizing the literature and datasets discussed. Finally, challenges, trends, and broader observations are throughout the survey and in the conclusion along with future directions of IDS research.
... To identify such sinkholed domain names, we take two steps, i.e., we collect name server records for a target domain name and match the record with known sinkholing information. We actively send DNS queries to domain names to collect corresponding name server records in a similar way to that in a previous study ( Kountouras et al., 2016 ). Then, we match collected name server records with known name server records used in a sinkholing operation. ...
Article
Full-text available
Since the 1980s, domain names and the domain name system (DNS) have been used and abused. Although legitimate Internet users rely on domain names as indispensable infrastructures for using the Internet, attackers use or abuse them as reliable, instantaneous, and distributed attack infrastructures. However, there is a lack of complete understanding of such domain-name abuses and methods for coping with them. In this study, we designed and implemented a unified analysis system combining current defense solutions to build actionable threat intelligence from malicious domain names. The basic concept underlying our system is malicious domain name chromatography. Our analysis system can distinguish among mixtures of malicious domain names for websites. On the basis of this concept, we do not create a hodgepodge of current solutions but design separation of abused domain names and offer actionable threat intelligence or defense information by considering the characteristics of malicious domain names as well as the possible defense solutions and points of defense. Finally, we evaluated our analysis system and defense-information output using a large real dataset to show the effectiveness and validity of our system.
... Active DNS: We also utilize an active DNS (ADN S) dataset, which we obtain daily from the Active DNS project [24]. Since the duration of this dataset is less than a year, it does not have a complete temporal overlap with our PDN S dataset. ...
Article
Full-text available
Domain squatting is a common adversarial practice where attackers register domain names that are purposefully similar to popular domains. In this work, we study a specific type of domain squatting called "combosquatting," in which attackers register domains that combine a popular trademark with one or more phrases (e.g., betterfacebook[.]com, youtube-live[.]com). We perform the first large-scale, empirical study of combosquatting by analyzing more than 468 billion DNS records---collected from passive and active DNS data sources over almost six years. We find that almost 60% of abusive combosquatting domains live for more than 1,000 days, and even worse, we observe increased activity associated with combosquatting year over year. Moreover, we show that combosquatting is used to perform a spectrum of different types of abuse including phishing, social engineering, affiliate abuse, trademark abuse, and even advanced persistent threats. Our results suggest that combosquatting is a real problem that requires increased scrutiny by the security community.
... For all other uses, contact the owner/author(s All these approaches are, however, reactive, thus becoming effective only once an attack is already taking place. Less studied in literature, with the exception of [3], is instead the possibility of using DNS measurements for pro-actively identifying domains set up for malicious use. ...
Conference Paper
The Domain Name System contains a wealth of information about the security, stability and health of the Internet. Most research that leverages the DNS for detection of malicious activities does so by using passive measurements. The limitation of this approach, however, is that it is effective only once an attack is ongoing. In this paper, we explore a different approach. We advocate the use of active DNS measurements for pro-active (i.e., before the actual attack) identification of domains set up for malicious use. Our research makes uses of data from the OpenINTEL large-scale active DNS measurement platform, which, since February 2015, collects daily snapshots of currently more than 60% of the DNS namespace. We illustrate the potential of our approach by showing preliminary results in three case studies, namely snowshoe spam, denial of service attacks and a case of targeted phishing known as CEO fraud.
... While there has been a lot of work on general DNS measurements [30], [31], transparency [32], operation [33], and security [34], there are only a few studies considering the introduced new gTLDs. The launch of new gTLDs has expanded the top-level domains used in global Internet. ...
Article
Full-text available
The centralized zone data service (CZDS) was introduced by the Internet Corporation for Assigned Names and Numbers (ICANN) to facilitate sharing and access to zone data of the new generic Top-Level Domains (gTLDs). CZDS aims to improve the security and transparency of the naming system of the Internet. In this paper, we investigate CZDS’s transparency by measurement and evaluation. By requesting access to zone data of all gTLDs listed in the CZDS portal, we analyze various aspects of CZDS, including access status, responsiveness and provided reasons for granting access or denial. Among other findings, we find that while a large percent of the gTLD admins respond within a reasonable time, more than 10% of them have a long request-to-decision waiting time, and sometimes requests go unanswered even after six months of a request. Furthermore, we find that denial cases were for unjustified reasons, where administrators who denied the requests have asked for information that was already provided in the request form. We discuss implications, and how to enforce better outcomes of CZDS using insight from our measurement and evaluation.
... A different approach to Passive DNS is to actively query and collect large volumes of DNS data. This is known as active DNS which has recently been proposed by an outstanding work presented in [29]. The DNS project, known as Thales, offers to both researchers and practitioners in the security area, a large, open-access DNS dataset that represents a significant portion of the world's daily DNS delegation hierarchy. ...
Article
Full-text available
The Domain Name System (DNS) is a critical infrastructure of any network, and, not surprisingly a common target of cybercrime. There are numerous works that analyse higher level DNS traffic to detect anomalies in the DNS or any other network service. By contrast, few efforts have been made to study and protect the recursive DNS level. In this paper, we introduce a novel abstraction of the recursive DNS traffic to detect a flooding attack, a kind of Distributed Denial of Service (DDoS). The crux of our abstraction lies on a simple observation: Recursive DNS queries, from IP addresses to domain names, form social groups; hence, a DDoS attack should result in drastic changes on DNS social structure. We have built an anomaly-based detection mechanism, which, given a time window of DNS usage, makes use of features that attempt to capture the DNS social structure, including a heuristic that estimates group composition. Our detection mechanism has been successfully validated (in a simulated and controlled setting) and with it the suitability of our abstraction to detect flooding attacks. To the best of our knowledge, this is the first time that work is successful in using this abstraction to detect these kinds of attacks at the recursive level. Before concluding the paper, we motivate further research directions considering this new abstraction, so we have designed and tested two additional experiments which exhibit promising results to detect other type of anomalies in recursive DNS servers.
... The Active DNS Project [225] is currently collecting A records of about 300M domains derived from 1.3K zone files on a daily basis. In addition to this effort, Rapid7 [12] also conducts active DNS measurements at a large scale and offers researchers access to its data. ...
Thesis
With the Internet having become an indispensable means of communication in modern society, censorship and surveillance in cyberspace are getting more prevalent. Malicious actors around the world, ranging from nation states to private organizations, are increasingly making use of technologies to not only control the free flow of information, but also eavesdrop on Internet users' online activities. Internet censorship and online surveillance have led to severe human rights violations, including the freedom of expression, the right to information, and privacy. In this dissertation, we present two related lines of research that seek to tackle the twin problems of Internet censorship and online surveillance via an empirical lens. We show that empirical network measurement, when conducted at scale and in a longitudinal manner, is an essential approach to gain insights into (1) censors' blocking behaviors and (2) key characteristics of anti-censorship and privacy-enhancing technologies. These insights can then be used to not only aid in the development of effective censorship circumvention tools, but also help related stakeholders making informed decisions to maximize the privacy benefits of privacy-enhancing technologies. With a focus on measuring Internet censorship, we first conduct an empirical study of the I2P anonymity network, shedding light on important properties of the network and its censorship resistance. By measuring the state of I2P censorship around the globe, we then expose numerous censorship regimes (e.g., China, Iran, Oman, Qatar, and Kuwait) where I2P are blocked by various techniques. As a result of this work, I2P has adopted DNS over HTTPS, which is one of the domain name encryption protocols introduced recently, to prevent passive snooping and make the bootstrapping process more resistant to DNS-based network filtering and surveillance. Of the censors discovered above, we find that China is the most sophisticated one, having developed an advanced network filtering system, known as the Great Firewall (GFW). Continuing the same line of work, we have developed GFWatch, a large-scale, longitudinal measurement platform capable of testing hundreds of millions of domains daily, enabling continuous monitoring of the DNS filtering behavior of China's GFW. Data collected by GFWatch does not only cast new light on technical observations, but also timely inform the public about changes in the GFW’s blocking policy and assist other detection and circumvention efforts. We then focus on measuring and improving the privacy benefits provided by domain name encryption technologies, such as DNS over TLS (DoT), DNS over HTTPS (DoH), and Encrypted Client Hello (ECH). Although the security benefits of these technologies are clear, their positive impact on user privacy is weakened by—the still exposed—IP address information. We assess the privacy benefits of these new technologies by considering the relationship between hostnames and their hosting IP addresses. We show that encryption alone is not enough to protect web users' privacy. Especially when it comes to preventing nosy network observers from tracking users' browsing activities, the IP address information of remote servers being contacted is still visible, which can then be employed to infer the visited websites. Our findings help raise awareness about the remaining effort that must be undertaken by related stakeholders (i.e., website owners and hosting providers) to ensure a meaningful privacy benefit from the universal deployment of domain name encryption technologies. Nevertheless, the benefits provided by DoT/DoH against threats ``under the recursive resolver'' come with the cost of trusting the DoT/DoH operator with the entire web browsing history of users. As a step towards mitigating the privacy concerns stemming from the exposure of all DNS resolutions of a user—effectively the user's entire domain-level browsing history—to an additional third-party entity, we proposed K-resolver, a resolution mechanism in which DNS queries are dispersed across multiple (K) DoH servers, allowing each of them to individually learn only a fraction (1/K) of a user's browsing history. Our experimental results show that our approach incurs negligible overhead while improving user privacy. Last, but not least, given that the visibility into plaintext domain information is lost due to the introduction of domain name encryption protocols, it is important to investigate whether and how network traffic of these protocols is interfered with by different Internet filtering systems. We created DNEye, a measurement system built on top of a network of distributed vantage points, which we used to study the accessibility of DoT/DoH and ESNI, and to investigate whether these protocols are tampered with by network providers (e.g., for censorship). We find evidence of blocking efforts against domain name encryption technologies in several countries, including China, Russia, and Saudi Arabia. On the bright side, we discover that domain name encryption can help with unblocking more than 55% and 95% of censored domains in China and other countries where DNS-based filtering is heavily employed.
... Another approach argued that a Cryptography-based Prefix preserving Anonymization algorithm [37] or other encryption techniques that would secure the IP prefix [38] should be employed. At the other end of the spectrum, an entirely different solution was proposed: the collection of active DNS data [39]. This was made possible by creating a system called Thales which can systematically query and collect large volumes of active DNS data using as input an aggregation of publicly accessible sources of domain names and URLs that have been collected for several years by the research team. ...
Preprint
Full-text available
The Domain Name System (DNS) was created to resolve the IP addresses of the web servers to easily remembered names. When it was initially created, security was not a major concern; nowadays, this lack of inherent security and trust has exposed the global DNS infrastructure to malicious actors. The passive DNS data collection process creates a database containing various DNS data elements, some of which are personal and need to be protected to preserve the privacy of the end users. To this end, we propose the use of distributed ledger technology. We use Hyperledger Fabric to create a permissioned blockchain, which only authorized entities can access. The proposed solution supports queries for storing and retrieving data from the blockchain ledger, allowing the use of the passive DNS database for further analysis, e.g. for the identification of malicious domain names. Additionally, it effectively protects the DNS personal data from unauthorized entities, including the administrators that can act as potential malicious insiders, and allows only the data owners to perform queries over these data. We evaluated our proposed solution by creating a proof-of-concept experimental setup that passively collects DNS data from a network and then uses the distributed ledger technology to store the data in an immutable ledger, thus providing a full historical overview of all the records.
... Several projects have released datasets as part of their results, and several studies have focused on the accumulation of datasets themselves. For example, Kountouras et al. [52] implemented a system called Thales, which creates massive amounts of malicious domain names by distilling multiple freely available sources, while Pearce et al. [53] developed a scalable, accurate, and ethical system, called Iris, which measures global name resolution and uses active manipulation to track the trends of domain names that evolve over time. In addition, Viglianisi et al. [54] referred to SysTaint, which facilitates reverse engineering of malware communications. ...
Article
Full-text available
Computer networks are facing serious threats from the emergence of malware with sophisticated DGAs (Domain Generation Algorithms). This type of DGA malware dynamically generates domain names by concatenating words from dictionaries for evading detection. In this paper, we propose an approach for identifying the callback communications of such dictionary-based DGA malware by analyzing their domain names at the word level. This approach is based on the following observations: These malware families use their own dictionaries and algorithms to generate domain names, and accordingly, the word usages of malware-generated domains are distinctly different from those of human-generated domains. Our evaluation indicates that the proposed approach is capable of achieving accuracy, recall, and precision as high as 0.9989, 0.9977, and 0.9869, respectively, when used with labeled datasets. We also clarify the functional differences between our approach and other published methods via qualitative comparisons. Taken together, these results suggest that malware-infected machines can be identified and removed from networks using DNS queries for detected malicious domain names as triggers. Our approach contributes to dramatically improving network security by providing a technique to address various types of malware encroachment.
... The dataset for this exploration comes from ActiveDNS [61]. This project out of Georgia Tech does daily active DNS lookups for millions of IP addresses and records the query results in a database. ...
Preprint
We study hypergraph visualization via its topological simplification. We explore both vertex simplification and hyperedge simplification of hypergraphs using tools from topological data analysis. In particular, we transform a hypergraph to its graph representations known as the line graph and clique expansion. A topological simplification of such a graph representation induces a simplification of the hypergraph. In simplifying a hypergraph, we allow vertices to be combined if they belong to almost the same set of hyperedges, and hyperedges to be merged if they share almost the same set of vertices. Our proposed approaches are general, mathematically justifiable, and they put vertex simplification and hyperedge simplification in a unifying framework.
... Several studies have focused specifically on collecting DNS datasets that alleviate the bias caused by various factors. For example, Kountouras et al. [36] implemented a system called Thales, that creates massive amounts of malicious domain names by distilling multiple freely available sources. Pearce et al. [37] developed a scalable, accurate, and ethical system, called Iris, that measures global name resolution and uses active manipulation to track the trends of domain names that evolve over time. ...
Article
Some of the most serious security threats facing computer networks involve malware. To prevent malware-related damage, administrators must swiftly identify and remove the infected machines that may reside in their networks. However, many malware families have domain generation algorithms (DGAs) to avoid detection. A DGA is a technique in which the domain name is changed frequently to hide the callback communication from the infected machine to the command-and-control server. In this article, we propose an approach for estimating the randomness of domain names by superficially analyzing their character strings. This approach is based on the following observations: human-generated benign domain names tend to reflect the intent of their domain registrants, such as an organization, product, or content. In contrast, dynamically generated malicious domain names consist of meaningless character strings because conflicts with already registered domain names must be avoided; hence, there are discernible differences in the strings of dynamically generated and human-generated domain names. Notably, our approach does not require any prior knowledge about DGAs. Our evaluation indicates that the proposed approach is capable of achieving recall and precision as high as 0.9960 and 0.9029, respectively, when used with labeled datasets. Additionally, this approach has proven to be highly effective for datasets collected via a campus network. Thus, these results suggest that malware-infected machines can be swiftly identified and removed from networks using DNS queries for detected malicious domains as triggers.
... Another approach argued that a Cryptography-based Prefix preserving Anonymization algorithm [37] or other encryption techniques that would secure the IP prefix [38] should be employed. At the other end of the spectrum, an entirely different solution was proposed: the collection of active DNS data [39]. This was made possible by creating a system called Thales which can systematically query and collect large volumes of active DNS data using as input an aggregation of publicly accessible sources of domain names and URLs that have been collected for several years by the research team. ...
Article
Full-text available
The Domain Name System (DNS) was created to resolve the IP addresses of web servers to easily remembered names. When it was initially created, security was not a major concern; nowadays, this lack of inherent security and trust has exposed the global DNS infrastructure to malicious actors. The passive DNS data collection process creates a database containing various DNS data elements, some of which are personal and need to be protected to preserve the privacy of the end users. To this end, we propose the use of distributed ledger technology. We use Hyperledger Fabric to create a permissioned blockchain, which only authorized entities can access. The proposed solution supports queries for storing and retrieving data from the blockchain ledger, allowing the use of the passive DNS database for further analysis, e.g., for the identification of malicious domain names. Additionally, it effectively protects the DNS personal data from unauthorized entities, including the administrators that can act as potential malicious insiders, and allows only the data owners to perform queries over these data. We evaluated our proposed solution by creating a proof-of-concept experimental setup that passively collects DNS data from a network and then uses the distributed ledger technology to store the data in an immutable ledger, thus providing a full historical overview of all the records.
... Passive DNS data provides a summarized view of domain queries. Experiments have shown that active DNS data provides more kinds of records, and passive DNS data provides a tighter connection graph [22]. Passive DNS can provide richer information than active DNS, but due to privacy issues and the location of the deployed sensors, the collected data has certain limitations. ...
Chapter
The Domain Name System (DNS) as the foundation of Internet, has been widely used by cybercriminals. A lot of malicious domain detection methods have received significant success in the past decades. However, existing detection methods usually use classification-based and association-based representations, which are not capable of dealing with the imbalanced problem between malicious and benign domains. To solve the problem, we propose a novel domain detection system named KSDom. KSDom designs a data collector to collect a large number of DNS traffic data and rich external DNS-related data, then employs K-means and SMOTE method to handle the imbalanced data. Finally, KSDom uses Categorical Boosting (CatBoost) algorithm to identify malicious domains. Comprehensive experimental results clearly show the effectiveness of our KSDom system and prove its good robustness in imbalanced datasets with different ratios. KSDom still has high accuracy even in extremely imbalanced DNS traffic.
Chapter
The Domain Name System is a critical piece of infrastructure that has expanded into use cases beyond its original intent. DNS TXT records are intentionally very permissive in what information can be stored there, and as a result are often used in broad and undocumented ways to support Internet security and networked applications. In this paper, we identified and categorized the patterns in TXT record use from a representative collection of resource record sets. We obtained the records from a data set containing 1.4 billion TXT records collected over a 2 year period and used pattern matching to identify record use cases present across multiple domains. We found that 92% of these records generally fall into 3 categories; protocol enhancement, domain verification, and resource location. While some of these records are required to remain public, we discovered many examples that unnecessarily reveal domain information or present other security threats (e.g., amplification attacks) in conflict with best practices in security.
Chapter
Most desktop applications use the network, and insecure communications can have a significant impact on the application, the system, the user, and the enterprise. Understanding at scale whether desktop application use the network securely is a challenge because the application provenance of a given network packet is rarely available at centralized collection points. In this paper, we collect flow data from 39,758 MacOS devices on an enterprise network to study the network behaviors of individual applications. We collect flows locally on-device and can definitively identify the application responsible for every flow. We also develop techniques to distinguish “endogenous” flows common to most executions of a program from “exogenous” flows likely caused by unique inputs. We find that popular MacOS applications are in fact using the network securely, with 95.62% of the applications we study using HTTPS. Notably, we observe security sensitive-services (including certificate management and mobile device management) do not use ports associated with secure communications. Our study provides important insights for users, device and network administrators, and researchers interested in secure communication.
Article
This survey focuses on intrusion detection systems (IDS) that leverage host-based data sources for detecting attacks on enterprise network. The host-based IDS (HIDS) literature is organized by the input data source, presenting targeted sub-surveys of HIDS research leveraging system logs, audit data, Windows Registry, file systems, and program analysis. While system calls are generally included in audit data, several publicly available system call datasets have spawned a flurry of IDS research on this topic, which merits a separate section. To accommodate current researchers, a section giving descriptions of publicly available datasets is included, outlining their characteristics and shortcomings when used for IDS evaluation. Related surveys are organized and described. All sections are accompanied by tables concisely organizing the literature and datasets discussed. Finally, challenges, trends, and broader observations are throughout the survey and in the conclusion along with future directions of IDS research. Overall, this survey was designed to allow easy access to the diverse types of data available on a host for sensing intrusion, the progressions of research using each, and the accessible datasets for prototyping in the area.
Article
Malicious websites often mimic top brands to host malware and launch social engineering attacks, e.g., to collect user credentials. Some such sites often attempt to hide malicious content from search engine crawlers (e.g., Googlebot), but show harmful content to users/client browsers—a technique known as cloaking. Past studies uncovered various aspects of cloaking, using selected categories of websites (e.g., mimicking specific types of malicious sites). We focus on understanding cloaking behaviors using a broader set of websites. As a way forward, we built a crawler to automatically browse and analyze content from 100000 squatting (mostly) malicious domains—domains that are generated through typo-squatting and combo-squatting of 2883 popular websites. We use a headless Chrome browser and a search-engine crawler with user-agent modifications to identify cloaking behaviors—a challenging task due to dynamic content, served at random; e.g., consecutive requests serve very different malicious or benign content. Most malicious sites (e.g., phishing and malware) go undetected by current blacklists; only a fraction of cloaked sites (127, 3.3%) are flagged as malicious by VirusTotal. In contrast, we identify 80% cloaked sites as malicious, via a semi-automated process implemented by extending the content categorization functionality of Symantec’s SiteReview tool. Even after 3 months of observation, nearly a half (1024, 45.4%) of the cloaked sites remained active, and only a few (31, 3%) of them are flagged by VirusTotal. This clearly indicate that existing blacklists are ineffective against cloaked malicious sites. Our techniques can serve as a starting point for more effective and scalable early detection of cloaked malicious sites.
Article
Full-text available
In this paper, we introduce DRIFT, a system for detecting command and control (C2) domain names in Internet of Things–scale botnets. Using an intrinsic feature of malicious domain name queries prior to their registration (perhaps due to clock drift), we devise a difference‐based lightweight feature for malicious C2 domain name detection. Using NXDomain query and response of a popular malware, we establish the effectiveness of our detector with 99% accuracy and as early as more than 48 hours before they are registered. Our technique serves as a tool of detection where other techniques relying on entropy or domain generating algorithms reversing are impractical. We introduce DRIFT, a system for detecting command and control (C2) domain names in Internet of Things (IoT)–scale botnets. Using an intrinsic feature of malicious domain name queries prior to their registration (perhaps due to clock drift), we devise a difference‐based lightweight feature for malicious C2 domain name detection. Using NXDomain query and response of a popular malware, we establish the effectiveness of our detector with 99% accuracy and as early as more than 48 hours before they are registered.
Chapter
We present DNS Unchained, a new application-layer DoS attack against core DNS infrastructure that for the first time uses amplification. To achieve an attack amplification of 8.51, we carefully chain CNAME records and force resolvers to perform deep name resolutions—effectively overloading a target authoritative name server with valid requests. We identify 178 508 potential amplifiers, of which 74.3% can be abused in such an attack due to the way they cache records with low Time-to-Live values. In essence, this allows a single modern consumer uplink to downgrade availability of large DNS setups. To tackle this new threat, we conclude with an overview of countermeasures and suggestions for DNS servers to limit the impact of DNS chaining attacks.
Article
The Domain Name System (DNS) is one of the most important components of today’s Internet, and is the standard naming convention between human-readable domain names and machine-routable Internet Protocol (IP) addresses of Internet resources. However, due to the vulnerability of DNS to various threats, its security and functionality have been continuously challenged over the course of time. Although, researchers have addressed various aspects of the DNS in the literature, there are still many challenges yet to be addressed. In order to comprehensively understand the root causes of the vulnerabilities of DNS, it is mandatory to review the various activities in the research community on DNS landscape. To this end, this paper surveys more than 170 peer reviewed papers, which are published in both top conferences and journals in last ten years, and summarizes vulnerabilities in DNS and corresponding countermeasures. This paper not only focuses on the DNS threat landscape and existing challenges, but also discusses the utilized data analysis methods, which are frequently used to address DNS threat vulnerabilities. Furthermore, we looked into the DNS threat landscape from the view point of the involved entities in the DNS infrastructure in an attempt to point out more vulnerable entities in the system.
Article
The Domain Name System (DNS) plays a crucial role in connecting services and users on the Internet. Since its first specification, DNS has been extended in numerous documents to keep it fit for today’s challenges and demands. And these challenges are many. Revelations of snooping on DNS traffic led to changes to guarantee confidentiality of DNS queries. Attacks to forge DNS traffic led to changes to shore up the integrity of the DNS. Finally, denial-of-service attack on DNS operations have led to new DNS operations architectures. All of these developments make DNS a highly interesting, but also highly challenging research topic. This tutorial – aimed at graduate students and early-career researchers – provides a overview of the modern DNS, its ongoing development and its open challenges. This tutorial has four major contributions. We first provide a comprehensive overview of the DNS protocol. Then, we explain how DNS is deployed in practice. This lays the foundation for the third contribution: a review of the biggest challenges the modern DNS faces today and how they can be addressed. These challenges are (i) protecting the confidentiality and (ii) guaranteeing the integrity of the information provided in the DNS, (iii) ensuring the availability of the DNS infrastructure, and (iv) detecting and preventing attacks that make use of the DNS. Last, we discuss which challenges remain open, pointing the reader towards new research areas.
Conference Paper
Full-text available
Fast Internet-wide scanning has opened new avenues for security research, ranging from uncovering widespread vulnerabilities in random number generators to tracking the evolving impact of Heartbleed. However, this technique still requires significant effort: even simple questions, such as, "What models of embedded devices prefer CBC ciphers?", require developing an application scanner, manually identifying and tagging devices, negotiating with network administrators, and responding to abuse complaints. In this paper, we introduce Censys, a public search engine and data processing facility backed by data collected from ongoing Internet-wide scans. Designed to help researchers answer security-related questions, Censys supports full-text searches on protocol banners and querying a wide range of derived fields (e.g., 443.https.cipher). It can identify specific vulnerable devices and networks and generate statistical reports on broad usage patterns and trends. Censys returns these results in sub-second time, dramatically reducing the effort of understanding the hosts that comprise the Internet. We present the search engine architecture and experimentally evaluate its performance. We also explore Censys's applications and show how questions asked in recent studies become simple to answer.
Conference Paper
Full-text available
Many botnet detection systems employ a blacklist of known command and control (C&C) domains to detect bots and block their traffic. Similar to signature-based virus detection, such a botnet detection approach is static because the blacklist is updated only after running an external (and often manual) process of domain discovery. As a response, botmasters have begun employing domain generation algorithms (DGAs) to dynamically produce a large number of random domain names and select a small subset for actual C&C use. That is, a C&C domain is randomly generated and used for a very short period of time, thus rendering detection approaches that rely on static domain lists ineffective. Naturally, if we know how a domain generation algorithm works, we can generate the domains ahead of time and still identify and block bot-net C&C traffic. The existing solutions are largely based on reverse engineering of the bot malware executables, which is not always feasible. In this paper we present a new technique to detect randomly generated domains without reversing. Our insight is that most of the DGA-generated (random) domains that a bot queries would result in Non-Existent Domain (NXDomain) responses, and that bots from the same bot-net (with the same DGA algorithm) would generate similar NXDomain traffic. Our approach uses a combination of clustering and classification algorithms. The clustering algorithm clusters domains based on the similarity in the make-ups of domain names as well as the groups of machines that queried these domains. The classification algorithm is used to assign the generated clusters to models of known DGAs. If a cluster cannot be assigned to a known model, then a new model is produced, indicating a new DGA variant or family. We implemented a prototype system and evaluated it on real-world DNS traffic obtained from large ISPs in North America. We report the discovery of twelve DGAs. Half of them are variants of known (botnet) DGAs, and the other half are brand new DGAs that have never been reported before.
Conference Paper
Full-text available
In recent years Internet miscreants have been leveraging the DNS to build malicious network infrastructures for malware command and control. In this paper we propose a novel detection system called Kopis for detecting malware-related domain names. Kopis passively monitors DNS traffic at the upper levels of the DNS hierarchy, and is able to accurately detect malware domains by analyzing global DNS query resolution patterns. Compared to previous DNS reputation systems such as Notos [3] and Exposure [4], which rely on monitoring traffic from local recursive DNS servers, Kopis offers a new vantage point and introduces new traffic features specifically chosen to leverage the global visibility obtained by monitoring network traffic at the upper DNS hierarchy. Unlike previous work Kopis enables DNS operators to independently (i.e., without the need of data from other networks) detect malware domains within their authority, so that action can be taken to stop the abuse. Moreover, unlike previous work, Kopis can detect malware domains even when no IP reputation information is available. We developed a proof-of-concept version of Kopis, and experimented with eight months of real-world data. Our experimental results show that Kopis can achieve high detection rates (e.g., 98.4%) and low false positive rates (e.g., 0.3% or 0.5%). In addition Kopis is able to detect new malware domains days or even weeks before they appear in public blacklists and security forums, and allowed us to discover the rise of a previously unknown DDoS botnet based in China.
Conference Paper
Full-text available
We present the first empirical study of fast-flux service networks (FFSNs), a newly emerging and still not widely- known phenomenon in the Internet. FFSNs employ DNS to establish a proxy network on compromised machines through which illegal online services can be hosted with very high availability. Through our measurements we show that the threat which FFSNs pose is significant: FFSNs oc- cur on a worldwide scale and already host a substantial percentage of online scams. Based on analysis of the prin- ciples of FFSNs, we develop a metric with which FFSNs can be effectively detected. Considering our detection technique we also discuss possible mitigation strategies.
Conference Paper
Full-text available
Researchers have recently noted (14; 27) the potential of fast poisoning attacks against DNS servers, which allows attackers to easily manipulate records in open recursive DNS resolvers. A vendor-wide upgrade mitigated but did not eliminate this attack. Further, existing DNS protection systems, including bailiwick-checking (12) and IDS-style filtration, do not stop this type of DNS poisoning. We therefore propose Anax, a DNS protection system that detects poisoned records in cache. Our system can observe changes in cached DNS records, and applies machine learning to classify these updates as malicious or benign. We describe our classification features and machine learning model selection process while noting that the proposed approach is easily integrated into existing local network protection systems. To evaluate Anax, we studied cache changes in a geographically diverse set of 300,000 open recursive DNS servers (ORDNSs) over an eight month period. Using hand-verified data as ground truth, evaluation of Anax showed a very low false positive rate (0.6% of all new resource records) and a high detection rate (91.9%).
Conference Paper
In this paper we study the structure of criminal networks, groups of related malicious infrastructures that work in concert to provide hosting for criminal activities. We develop a method to construct a graph of relationships between malicious hosts and identify the underlying criminal networks, using historic assignments in the DNS. We also develop methods to analyze these networks to identify general structural trends and devise strategies for effective remediation through takedowns. We then apply these graph construction and analysis algorithms to study the general threat landscape, as well as four cases of sophisticated criminal networks. Our results indicate that in many cases, criminal networks can be taken down by de-registering as few as five domain names, removing critical communication links. In cases of sophisticated criminal networks, we show that our analysis techniques can identify hosts that are critical to the network’s functionality and estimate the impact of performing network takedowns in remediating the threats. In one case, disabling 20% of a criminal network’s hosts would reduce the overall volume of successful DNS lookups to the criminal network by as much as 70%. This measure can be interpreted as an estimate of the decrease in the number of potential victims reaching the criminal network that would be caused by such a takedown strategy.
Conference Paper
In this paper, we present an analysis of a new class of domain names: disposable domains. We observe that popular web applications, along with other Internet services, systematically use this new class of domain names. Disposable domains are likely generated automatically, characterized by a 'one-time use' pattern, and appear to be used as a way of 'signaling' via DNS queries. To shed light on the pervasiveness of disposable domains, we study 24 days of live DNS traffic spanning a year observed at a large Internet Service Provider. We find that disposable domains increased from 23.1% to 27.6% of all queried domains, and from 27.6% to 37.2% of all resolved domains observed daily. While this creative use of DNS may enable new applications, it may also have unanticipated negative consequences on the DNS caching infrastructure, DNSSEC validating resolvers, and passive DNS data collection systems.
Article
Overview The domain name system (abbreviated 'DNS') provides a distributed database that maps do-main names to record sets (for example, IP addresses). DNS is one of the core protocol suites of the Internet. Yet DNS data is often volatile, and there are many unwanted records present in the domain name system. This paper presents a technology, called passive DNS replication, to obtain domain name system data from production networks, and store it in a database for later reference. The present paper is structured as follows: • Section 1 briefly recalls a few DNS-related terms used throughout this paper. • Section 2 motivates the need for passive DNS replication: DNS itself does not allow cer-tain queries whose results are interesting in various contexts (mostly related to response to security incidents). • Section 3 describes the architecture and of the dnslogger software, an implementa-tion of passive DNS replication. • In section 4, successful applications of the technology are documented.
Article
The Botnet threats, such as server attacks or sending of spam email, have been increasing. A method of using a blacklist of domain names has been proposed to find infected hosts. However, not all infected hosts may be found by this method because a blacklist does not cover all black domain names. In this paper, we present a method for finding unknown black domain names and extend the blacklist by using DNS traffic data and the original blacklist of known black domain names. We use co-occurrence relation of two different domain names to find unknown black domain names and extend a blacklist. If a domain name co-occurs with a known black name frequently, we assume that the domain name is also black. We evaluate the proposed method by cross validation, about 91 % of domain names that are in the validation list can be found as top 1 %.
Conference Paper
In this paper we explore the potential of leveraging properties inherent to domain registrations and their appearance in DNS zone files to predict the malicious use of domains proactively, using only minimal observation of known-bad domains to drive our inference. Our analysis demonstrates that our inference procedure derives on average 3.5 to 15 new domains from a given known-bad domain. 93 % of these inferred domains subsequently appear suspect (based on third-party assessments), and nearly 73 % eventually appear on blacklists themselves. For these latter, proactively blocking based on our predictions provides a median headstart of about 2 days versus using a reactive blacklist, though this gain varies widely for different domains. 1
Conference Paper
Phishing has been easy and effective way for trickery and deception on the Internet. While solutions such as URL blacklisting have been effective to some degree, their reliance on exact match with the blacklisted entries makes it easy for attackers to evade. We start with the observation that attackers often employ simple modifications (e.g., changing top level domain) to URLs. Our system, PhishNet, exploits this observation using two components. In the first component, we propose five heuristics to enumerate simple combinations of known phishing sites to discover new phishing URLs. The second component consists of an approximate matching algorithm that dissects a URL into multiple components that are matched individually against entries in the blacklist. In our evaluation with real-time blacklist feeds, we discovered around 18,000 new phishing URLs from a set of 6,000 new blacklist entries. We also show that our approximate matching algorithm leads to very few false positives (3%) and negatives (5%).
Conference Paper
Malicious Web sites are a cornerstone of Internet criminal activi- ties. As a result, there has been broad interest in developing sys- tems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lex- ical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features poten- tially indicative of suspicious URLs. The resulting classifiers ob- tain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.
Conference Paper
We collected DNS responses at the University of Auckland Internet gateway in an SQL database, and analyzed them to detect un- usual behaviour. Our DNS response data have included typo squatter domains, fast flux domains and domains being (ab)used by spammers. We observe that current attempts to reduce spam have greatly increased the number of A records being resolved. We also observe that the data locality of DNS requests diminishes because of domains advertised in spam.
Conference Paper
An increasingly popular technique for decreasing user-perceived latency while browsing the Web is to optimistically pre-resolve (or prefetch) domain name resolutions. In this paper, we present a large-scale evaluation of this practice using data collected over the span of several months, and show that it leads to noticeable increases in load on name servers-with questionable caching benefits. Furthermore, to assess the impact that prefetching can have on the deployment of security extensions to DNS (DNSSEC), we use a custom-built cache simulator to perform trace-based simulations using millions of DNS requests and responses collected campus-wide. We also show that the adoption of domain name prefetching raises privacy issues. Specifically, we examine how prefetching amplifies information disclosure attacks to the point where it is possible to infer the context of searches issued by clients.
Conference Paper
The Domain Name System (DNS) is a one of the most widely used services in the Internet. In this paper, we consider the question of how DNS traffic monitoring can provide an important and useful perspective on network traffic in an enterprise. We approach this problem by considering three classes of DNS traffic: canonical (i.e., RFC-intended behaviors), overloaded (e.g.,black-list services), and unwanted (i.e., queries that will never succeed). We describe a context-aware clustering methodology that is applied to DNS query-responses to generate the desired aggregates. Our method enables the analysis to be scaled to expose the desired level of detail of each traffic type, and to expose their time varying characteristics. We implement our method in a tool we call TreeTop, which can be used to analyze and visualize DNS traffic in real-time. We demonstrate the capabilities of our methodology and the utility of TreeTop using a set of DNS traces that we collected from our campus network over a period of three months. Our evaluation highlights both the coarse and fine level of detail that can be revealed by our method. Finally, we show preliminary results on how DNS analysis can be coupled with general network traffic monitoring to provide a useful perspective for network management and operations.
Special Use IPv4 Addresses. RFC 5735 (Best Current Practice), Obsoleted by RFC 6890, updated by RFC 6598
  • M Cotton
  • L Vegoda
Snake in the grass: Python-based malware used for targeted attacks
  • B Coat
IANA-Reserved IPv4 Prefix for Shared Address Space
  • J Weil
  • V Kuarsingh
  • C Donley
  • C Liljenstolpe
  • M Azinger
Extending black domain name list by using co-occurrence relation between DNS queries
  • K Ishibashi
  • T Toyono
  • H Hasegawa
  • H Yoshino
Address Allocation for Private Internets. RFC 1918 (Best Current Practice), Updated by RFC 6761
  • Y Rekhter
  • B Moskowitz
  • D Karrenberg
  • G J De Groot
  • E Lear
  • L Daigle