Conference Paper

Enabling Network Security Through Active DNS Datasets

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Most modern cyber crime leverages the Domain Name System (DNS) to attain high levels of network agility and make detection of Internet abuse challenging. The majority of malware, which represent a key component of illicit Internet operations, are programmed to locate the IP address of their command-and-control (C&C) server through DNS lookups. To make the malicious infrastructure both agile and resilient, malware authors often use sophisticated communication methods that utilize DNS (i.e., domain generation algorithms) for their campaigns. In general, Internet miscreants make extensive use of short-lived disposable domains to promote a large variety of threats and support their criminal network operations. To effectively combat Internet abuse, the security community needs access to freely available and open datasets. Such datasets will enable the development of new algorithms that can enable the early detection, tracking, and overall lifetime of modern Internet threats. To that end, we have created a system, Thales, that actively queries and collects records for massive amounts of domain names from various seeds. These seeds are collected from multiple public sources and, therefore, free of privacy concerns. The results of this effort will be opened and made freely available to the research community. With three case studies we demonstrate the detection merit that the collected active DNS datasets contain. We show that (i) more than 75 % of the domain names in public black lists (PBLs) appear in our datasets several weeks (and some cases months) in advance, (ii) existing DNS research can be implemented using only active DNS, and (iii) malicious campaigns can be identified with the signal provided by active DNS.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In exploring the DNS research methods, we have investigated various data analysis methods, which have been utilized to detect, model, and mitigate the aforementioned DNS threats. First, we describe two main DNS data collection methods utilized in the literature and the associated works, including passive DNS data (PDNS) [80,47,81,82,83,12,11,84,48,41,67,22,51] and Active DNS data (ADNS) [67,83,85,54,41,86]. Next, we categorize the research works based on the common DNS data analysis techniques that have been used in the literature, such as machine learning algorithms [80,17,46,53,22,54,55,56,87,88,89,76] and association analysis [59,82,90,54,91]. ...
... In exploring the DNS research methods, we have investigated various data analysis methods, which have been utilized to detect, model, and mitigate the aforementioned DNS threats. First, we describe two main DNS data collection methods utilized in the literature and the associated works, including passive DNS data (PDNS) [80,47,81,82,83,12,11,84,48,41,67,22,51] and Active DNS data (ADNS) [67,83,85,54,41,86]. Next, we categorize the research works based on the common DNS data analysis techniques that have been used in the literature, such as machine learning algorithms [80,17,46,53,22,54,55,56,87,88,89,76] and association analysis [59,82,90,54,91]. ...
... High-level architecture of passive DNS measurement systems is shown in Figure 11. Passive DNS databases based on valuable information they collect, have been considered as an invaluable asset of cybersecurity researchers to combat a wide range of threats such as malware, botnets, and malicious actors [80,47,81,82,83,12,11,84,48,41,67,22,51]. on the performance of DNS caching. ...
Preprint
Full-text available
The domain name system (DNS) is one of the most important components of today's Internet, and is the standard naming convention between human-readable domain names and machine-routable IP addresses of Internet resources. However, due to the vulnerability of DNS to various threats, its security and functionality have been continuously challenged over the course of time. Although, researchers have addressed various aspects of the DNS in the literature, there are still many challenges yet to be addressed. In order to comprehensively understand the root causes of the vulnerabilities of DNS, it is mandatory to review the various activities in the research community on DNS landscape. To this end, this paper surveys more than 170 peer-reviewed papers, which are published in both top conferences and journals in the last ten years, and summarizes vulnerabilities in DNS and corresponding countermeasures. This paper not only focuses on the DNS threat landscape and existing challenges, but also discusses the utilized data analysis methods, which are frequently used to address DNS threat vulnerabilities. Furthermore, we looked into the DNSthreat landscape from the viewpoint of the involved entities in the DNS infrastructure in an attempt to point out more vulnerable entities in the system.
... Using passive measurement, DNS data is obtained by an entity who is in a position to capture DNS trac from the network infrastructure under its control (e.g., networks of academic institutes or small organizations) [43]. Several previous studies use passive measurement to observe DNS trac [5,8,20,37,43]. Passive measurements, however, may introduce bias in the data collected depending on the time, location, and demographics of users within the monitored network. Moreover, another issue with passive data collection is ethics, as data gathered over a long period of time can reveal online habits of monitored users. ...
... Researchers can choose which domains to resolve depending on the goals of their study, thus having more control over the collected data. Although this approach can remedy the privacy issue of passive DNS measurement, it requires an increased amount of resources for running a dedicated measurement infrastructure if there is a large number of domains that need to be resolved [20]. There are prior works that have been conducting large-scale active DNS measurements for dierent purposes and provide their datasets to the community [20,33]. ...
... Although this approach can remedy the privacy issue of passive DNS measurement, it requires an increased amount of resources for running a dedicated measurement infrastructure if there is a large number of domains that need to be resolved [20]. There are prior works that have been conducting large-scale active DNS measurements for dierent purposes and provide their datasets to the community [20,33]. ...
Article
Full-text available
Understanding web co-location is essential for various reasons. For instance, it can help one to assess the collateral damage that denial-of-service attacks or IP-based blocking can cause to the availability of co-located web sites. However, it has been more than a decade since the first study was conducted in 2007. The Internet infrastructure has changed drastically since then, necessitating a renewed study to comprehend the nature of web co-location. In this paper, we conduct an empirical study to revisit web co-location using datasets collected from active DNS measurements. Our results show that the web is still small and centralized to a handful of hosting providers. More specifically, we find that more than 60% of web sites are co-located with at least ten other web sites---a group comprising less popular web sites. In contrast, 17.5% of mostly popular web sites are served from their own servers. Although a high degree of web co-location could make co-hosted sites vulnerable to DoS attacks, our findings show that it is an increasing trend to co-host many web sites and serve them from well-provisioned content delivery networks (CDN) of major providers that provide advanced DoS protection benefits. Regardless of the high degree of web co-location, our analyses of popular block lists indicate that IP-based blocking does not cause severe collateral damage as previously thought.
... Using passive measurement, DNS data is obtained by an entity who is in a position to capture DNS traffic from the network infrastructure under its control (e.g., networks of academic institutes or small organizations) [43]. Several previous studies use passive measurement to observe DNS traffic [5,8,20,37,43]. Passive measurements, however, may introduce bias in the data collected depending on the time, location, and demographics of users within the monitored network. Moreover, another issue with passive data collection is ethics, as data gathered over a long period of time can reveal online habits of monitored users. ...
... Researchers can choose which domains to resolve depending on the goals of their study, thus having more control over the collected data. Although this approach can remedy the privacy issue of passive DNS measurement, it requires an increased amount of resources for running a dedicated measurement infrastructure if there is a large number of domains that need to be resolved [20]. There are prior works that have been conducting large-scale active DNS measurements for different purposes and provide their datasets to the community [20,33]. ...
... Although this approach can remedy the privacy issue of passive DNS measurement, it requires an increased amount of resources for running a dedicated measurement infrastructure if there is a large number of domains that need to be resolved [20]. There are prior works that have been conducting large-scale active DNS measurements for different purposes and provide their datasets to the community [20,33]. ...
Preprint
Full-text available
Understanding web co-location is essential for various reasons. For instance, it can help one to assess the collateral damage that denial-of-service attacks or IP-based blocking can cause to the availability of co-located web sites. However, it has been more than a decade since the first study was conducted in 2007. The Internet infrastructure has changed drastically since then, necessitating a renewed study to comprehend the nature of web co-location. In this paper, we conduct an empirical study to revisit web co-location using datasets collected from active DNS measurements. Our results show that the web is still small and centralized to a handful of hosting providers. More specifically, we find that more than 60% of web sites are co-located with at least ten other web sites---a group comprising less popular web sites. In contrast, 17.5% of mostly popular web sites are served from their own servers. Although a high degree of web co-location could make co-hosted sites vulnerable to DoS attacks, our findings show that it is an increasing trend to co-host many web sites and serve them from well-provisioned content delivery networks (CDN) of major providers that provide advanced DoS protection benefits. Regardless of the high degree of web co-location, our analyses of popular block lists indicate that IP-based blocking does not cause severe collateral damage as previously thought.
... For years, TLD zone files have been a popular tool in the security community to find active domain names [12,30,34,36,37,43]. Zone files are publicly available to researchers who request access and they represent a snapshot of resolvable domains, but not all registered domains will appear in zone files. ...
... Because of this behavior, we conclude that zone files alone cannot be used to indicate whether a domain is registered or not. Therefore systems that rely on zone files [12,30,34,36,37,43] are bound to be missing domains that are temporarily suspended and ones that are registered but are missing. ...
... Another lesson from this study is that registration data including status, registrar, and dates should be maintained as a publicly available resource. Public access to zone files has been very successful in aiding security research and applications [12,30,34,36,37,43], but it is not enough to identify all registered domain names, nor does it cover all stages of the domain life-cycle making cases like early deletion and dropcatching more difficult to monitor. In the past, query limits on WHOIS were a reasonable precaution to prevent mass collection of registrants' personal information, but with recent changes to WHOIS privacy, largely driven by the EU's GDPR, this is no longer necessary. ...
Conference Paper
Full-text available
Domain names are a valuable resource on the web. Most domains are available to the public on a first-come, first-serve basis and once domains are purchased, the owners keep them for a period of at least one year before they may choose to renew them. Common wisdom suggests that even if a domain name stops being useful to its owner, the owner will merely wait until the domain organically expires and choose not to renew. In this paper, contrary to common wisdom, we report on the discovery that domain names are often deleted before their expiration date. This is concerning because this practice offers no advantage for legitimate users, while malicious actors deleting domains may hamper forensic analysis of malicious campaigns, and registrars deleting domains instead of suspending them enable re-registration and continued abuse. Specifically, we present the first systematic analysis of early domain name disappearances from the largest top-level domains (TLDs). We find more than 386,000 cases where domain names were deleted before expiring and we discover individuals with more than 1,000 domains deleted in a single day. Moreover, we identify the specific registrars that choose to delete domain names instead of suspending them. We compare lexical features of these domains, finding significant differences between domains that are deleted early, suspended, and organically expiring. Furthermore, we explore potential reasons for deletion finding over 7,000 domain names squatting more popular domains and more than 14,000 associated with malicious registrants.
... Many previous studies use passive measurement to observe DNS traffic on their networks [14,31,61,97,108]. However, passive data collection can suffer from bias depending on the time, location, and demographics of users within the observed network. ...
... There are also prior works (by both academia and industry) that conducted large-scale active DNS measurements for several purposes and made their datasets available to the community [61,90,104]. However, all of these datasets have two common issues that make them unsuitable to be used directly in our study. ...
... In addition, we also analyze reverse DNS data to examine the current adoption status of reverse mapping from IP to domain name, and compare it to our number of single-hosted domains found in §5.1, as this is another potential way for an adversary to infer a visited domain. The Active DNS Project [61] is currently collecting A records of about 300M domains derived from 1.3K TLD zone files on a daily basis. In addition to this effort, Rapid7 also conducts active DNS measurements at a large scale and offers researchers access to its data [90]. ...
Preprint
Full-text available
As Internet users have become more savvy about the potential for their Internet communication to be observed, the use of network traffic encryption technologies (e.g., HTTPS/TLS) is on the rise. However, even when encryption is enabled, users leak information about the domains they visit via their DNS queries and via the Server Name Indication (SNI) extension of TLS. Two proposals to ameliorate this issue are DNS over HTTPS/TLS (DoH/DoT) and Encrypted SNI (ESNI). In this paper we aim to assess the privacy benefits of these proposals by considering the relationship between hostnames and IP addresses, the latter of which are still exposed. We perform DNS queries from nine vantage points around the globe to characterize this relationship. We quantify the privacy gain due to ESNI for different hosting and CDN providers using two different metrics, the k-anonymity degree due to co-hosting and the dynamics of IP address changes. We find that 20% of the domains studied will not gain any privacy benefit since they have a one-to-one mapping between their hostname and IP address. Our results show that 30% will gain a high level of privacy benefit with a k value greater than 100, meaning that an adversary can correctly guess these domains with a probability less than 1%. Domains whose visitors will gain a high privacy level are far less popular, while visitors of popular sites will gain much less. Analyzing the dynamics of IP addresses of long-lived domains, we find that only 7.7% of them change their hosting IP addresses on a daily basis. We conclude by discussing potential approaches for website owners and hosting/CDN providers for maximizing the privacy benefits of ESNI.
... Active DNS data do not have privacy problems because they do not include information on the user query domains. Thales [14] is an example of privacy-preserving active DNS data collection system that actively queries and collects a large volume of active DNS data using domain names from various publicly accessible sources. Passive DNS data provide historical records of the domain and contain richer information than active DNS data. ...
... Passive DNS data is gathered by deploying sensors on multiple DNS servers and DNS server logs to obtain real DNS queries and response information, but there are certain limitations and privacy issues on collected data depending on the location of deployed sensors, especially if sensors are deployed between clients and resolvers. The authors of Ref. [14] provided the experiment on active DNS vs. passive DNS data, and it is shown that active DNS data has more DNS record types while passive DNS data provide a tighter connection graph. Based on this [6], active DNS data can be used to discover newly created and potentially malicious domains. ...
Article
Different types of malicious attacks have been increasing simultaneously and have become a serious issue for cybersecurity. Most attacks leverage domain URLs as an attack communications medium and compromise users into a victim of phishing or spam. We take advantage of machine learning methods to detect the maliciousness of a domain automatically using three features: DNS-based, lexical, and semantic features. The proposed approach exhibits high performance even with a small training dataset. The experimental results demonstrate that the proposed scheme achieves an approximate accuracy of 0.927 when using a random forest classifier.
... DNS data is one of the most notable sources of information utilized to detect malicious domains [Kountouras et al. 2016;Weimer 2005]. In general, there are two types of approaches that complement each other. ...
... For example, Farsight passive DNS data [Farsight Security, Inc. 2019] utilizes sensors deployed behind DNS resolvers and provides aggregate information about domain resolutions. Active DNS data is collected by periodically querying a large pre-compiled list of domains in the Internet (e.g., Kountouras et al. [2016]). Passive DNS data has been an invaluable source of information for detecting and mitigating malicious activities in the Internet [Antonakakis et al. 2012;J. ...
Article
Full-text available
Malicious domains, including phishing websites, spam servers, and command and control servers, are the reason for many of the cyber attacks nowadays. Thus, detecting them in a timely manner is important to not only identify cyber attacks but also take preventive measures. There has been a plethora of techniques proposed to detect malicious domains by analyzing Domain Name System (DNS) traffic data. Traditionally, DNS acts as an Internet miscreant’s best friend, but we observe that the subtle traces in DNS logs left by such miscreants can be used against them to detect malicious domains. Our approach is to build a set of domain graphs by connecting “related” domains together and injecting known malicious and benign domains into these graphs so that we can make inferences about the other domains in the domain graphs. A key challenge in building these graphs is how to accurately identify related domains so that incorrect associations are minimized and the number of domains connected from the dataset is maximized. Based on our observations, we first train two classifiers and then devise a set of association rules that assist in linking domains together. We perform an in-depth empirical analysis of the graphs built using these association rules on passive DNS data and show that our techniques can detect many more malicious domains than the state-of-the-art.
... Active DNS data sets (collected by e.g., OpenINTEL [91]) rely on scanning zone files or popular domains to obtain these records, while passive DNS data sets (collected by e.g., Farsight Security [32]) extract them from monitored DNS responses. Both types of data sets have been used to detect malicious domain registrations and activity [19], [52], [84]. ...
... Passive DNS data collection may also have privacy implications [52], and requires sufficient storage and processing resources. Active DNS data collection has similar storage and resource needs, especially to ensure that records are updated sufficiently frequently. ...
... As an example, the connection between non-residential IPs and web services can be captured by the average number of TLD+3 domains per IP in the direct inetnum ( §II). Intuitively, this feature describes the number of domains hosted in the direct inetnum of this IP, which were found from Active DNS dataset [68]. Our evaluation on the labeled set shows that non-residential IPs have 5.49 as the average feature value while residential IPs only have 0.016. ...
... 68.92% are labeled as malware sites, 29.97% being malicious sites and 2.24% being phishing sites). Examples include ntkrnlpa.cn, ...
Conference Paper
Full-text available
An emerging Internet business is residential proxy (RESIP) as a service, in which a provider utilizes the hosts within residential networks (in contrast to those running in a datacenter) to relay their customers' traffic, in an attempt to avoid server-side blocking and detection. With the prominent roles the services could play in the underground business world, little has been done to understand whether they are indeed involved in Cybercrimes and how they operate, due to the challenges in identifying their RESIPs, not to mention any in-depth analysis on them. In this paper, we report the first study on RESIPs, which sheds light on the behaviors and the ecosystem of these elusive gray services. Our research employed an infiltration framework, including our clients for RESIP services and the servers they visited, to detect 6 million RESIP IPs across 230+ countries and 52K+ ISPs. The observed addresses were analyzed and the hosts behind them were further fingerprinted using a new profiling system. Our effort led to several surprising findings about the RESIP services unknown before. Surprisingly, despite the providers' claim that the proxy hosts are willingly joined, many proxies run on likely compromised hosts including IoT devices. Through cross-matching the hosts we discovered and labeled PUP (potentially unwanted programs) logs provided by a leading IT company, we uncovered various illicit operations RESIP hosts performed, including illegal promotion, Fast fluxing, phishing, malware hosting, and others. We also reverse engineered RESIP services' internal infrastructures, uncovered their potential rebranding and reselling behaviors. Our research takes the first step toward understanding this new Internet service, contributing to the effective control of their security risks.
... In particular, several defense mechanisms have been developed to filter out abnormal DNS traffic associated with notorious domains and directed toward malicious servers. Some solutions use reputation-based approaches (e.g., [6,22]), while others rely on the characteristics of benign domain and nameserver (e.g., [9,13,41]) Unfortunately, URs capitalizing on the reputation of popular domains and service providers can bypass such protections. Another type of defense mechanism (e.g., DNSSEC [31] and some advanced firewall [14,52]) focuses on examining the DNS traffic following the normal resolution. ...
... A single domain can be assigned to multiple servers for web hosting services, which correspond to multiple IP addresses. We study a subset of DNS records, from April 26, 2018, as part of the ActiveDNS dataset (Kountouras et al. 2016). We explore the relationship between domains and IP addresses via an ActiveDNS hypergraph H 0 , where nodes are IP addresses and hyperedges are domains. ...
Article
Full-text available
Hypergraphs capture multi-way relationships in data, and they have consequently seen a number of applications in higher-order network analysis, computer vision, geometry processing, and machine learning. In this paper, we develop theoretical foundations for studying the space of hypergraphs using ingredients from optimal transport. By enriching a hypergraph with probability measures on its nodes and hyperedges, as well as relational information capturing local and global structures, we obtain a general and robust framework for studying the collection of all hypergraphs. First, we introduce a hypergraph distance based on the co-optimal transport framework of Redko et al. and study its theoretical properties. Second, we formalize common methods for transforming a hypergraph into a graph as maps between the space of hypergraphs and the space of graphs, and study their functorial properties and Lipschitz bounds. Finally, we demonstrate the versatility of our Hypergraph Co-Optimal Transport (HyperCOT) framework through various examples.
... This can be addressed by implementing an alternative DNS name server (or host-based firewall) combine with the creation of an exclusion and inclusion list of domains. However, until the domain or the corresponding IP address is blacklisted [26], the DNS connection will be successful in the cloud environment. In the aforementioned experiments, Iodine was used as the tunnelling application. ...
Article
Full-text available
The domain name system (DNS) protocol is fundamental to the operation of the internet, however, in recent years various methodologies have been developed that enable DNS attacks on organisations. In the last few years, the increased use of cloud services by organisations has created further security challenges as cyber criminals use numerous methodologies to exploit cloud services, configurations and the DNS protocol. In this paper, two different DNS tunnelling methods, Iodine and DNScat, have been conducted in the cloud environment (Google and AWS) and positive results of exfiltration have been achieved under different firewall configurations. Detection of malicious use of DNS protocol can be a challenge for organisations with limited cybersecurity support and expertise. In this study, various DNS tunnelling detection techniques were utilised in a cloud environment to create an effective monitoring system with a reliable detection rate, low implementation cost, and ease of use for organisations with limited detection capabilities. The Elastic stack (an open-source framework) was used to configure a DNS monitoring system and to analyse the collected DNS logs. Furthermore, payload and traffic analysis techniques were implemented to identify different tunnelling methods. This cloud-based monitoring system offers various detection techniques that can be used for monitoring DNS activities of any network especially accessible to small organisations. Moreover, the Elastic stack is open-source and it has no limitation with regards to the data that can be uploaded daily.
... Another point of view instructed that a Cryptography-based Prefix preserving anonymisation algorithm should be used to address this issue [Govil and Govil, 2007] or other encryption techniques that would secure the IP prefix [Xu et al., 2002]. Other researchers trying to overcome this conflict, came up with a totally different solution, the collection of Active DNS data [Kountouras et al., 2016]. In specific, they created a system called Thales which can systematically query and collect large volumes of active DNS data using as input an aggregation of publicly accessible sources of vastly amount of domain names and URLs. ...
... Based off this intuition, a graph-based inference technique over associated domains (i.e., with common IPs) is constructed to enable the discovery of a large set of unknown malicious domains by using a small set of domains known to be malicious. The approach was evalu- Kountouras et al. (2016) took a different approach to facilitating the malicious domain detection effort by offering a large-scale and freely-available means of aggregating active DNS data. The proposed system, referred to as Thales, accomplishes this aim by generating numerous DNS queries from a list of publicly accessible sources of collected domain names (e.g., the Alexa list, various TLD zone files, etc.). ...
Article
Full-text available
As the Internet has transformed into a critical infrastructure, society has become more vulnerable to its security flaws. Despite substantial efforts to address many of these vulnerabilities by industry, government, and academia, cyber security attacks continue to increase in intensity, diversity, and impact. Thus, it becomes intuitive to investigate the current cyber security threats, assess the extent to which corresponding defenses have been deployed, and evaluate the effectiveness of risk mitigation efforts. Addressing these issues in a sound manner requires large-scale empirical data to be collected and analyzed via numerous Internet measurement techniques. Although such measurements can generate comprehensive and reliable insights, doing so encompasses complex procedures involving the development of novel methodologies to ensure accuracy and completeness. Therefore, a systematic examination of recently developed Internet measurement approaches for cyber security must be conducted to enable thorough studies that employ several vantage points, correlate multiple data sources, and potentially leverage past successful techniques for more recent issues. Unfortunately, performing such an examination is challenging, as the literature is highly scattered. In large part, this is due to each research effort only focusing on a small portion of the many constituent parts of the Internet measurement domain. Moreover, to the best of our knowledge, no studies have offered an in-depth examination of this critical research domain in order to promote future advancements. To bridge these gaps, we explore all pertinent facets of utilizing Internet measurement techniques for cyber security, ranging from threats within specific application domains to threats themselves. We provide a taxonomy of cyber security-related Internet measurement studies across two dimensions. One dimension relates to the many vertical layers (and components) of the Internet ecosystem, while the other relates to internal normal functions vs. the negative impact of external parties in the Internet and physical world. A comprehensive comparison of the gathered studies is also offered in terms of measurement technique, scope, measurement size, vantage size, and the analysis approach that was leveraged. Finally, a discussion of the roadblocks to performing effective Internet measurements and possible future research directions is elaborated.
... More specifically, we collected approximately 209 million resource records (RRs) -queried domain name, and associated RDATAand their lookup volumes aggregated on a daily basis. For our active DNS dataset, we obtained 290 million RRs per day from Thales, an active DNS monitoring system [44]. Both datasets cover the period of August 1, 2016 to February 28, 2017. ...
Article
Full-text available
The Mirai botnet, composed primarily of embedded and IoT devices, took the Internet by storm in late 2016 when it overwhelmed several high-profile targets with massive distributed denial-of-service (DDoS) attacks. In this paper, we provide a seven-month retrospective analysis of Mirai's growth to a peak of 600k infections and a history of its DDoS victims. By combining a variety of measurement perspectives, we analyze how the bot-net emerged, what classes of devices were affected, and how Mirai variants evolved and competed for vulnerable hosts. Our measurements serve as a lens into the fragile ecosystem of IoT devices. We argue that Mirai may represent a sea change in the evolutionary development of botnets-the simplicity through which devices were infected and its precipitous growth, demonstrate that novice malicious techniques can compromise enough low-end devices to threaten even some of the best-defended targets.
... Authors who choose to identify abused domains in this step [5] use lexical features coming from the domain and the registrant's data, which can prevent the abused domain from being used. After the payment and the first update of the DNS zone, the authors usually employ the approaches through the active [6] and passive [7] DNS data, called the postregistration. Passive DNS is the collection of communication between DNS servers, performed by sniffers installed in the network, in which it is possible to obtain queries and responses from DNS servers [8]. ...
Article
Full-text available
DNS is vital for the proper functioning of the Internet. However, users use this structure for domain registration and abuse. These domains are used as tools for these users to carry out the most varied attacks. Thus, early detection of abused domains prevents more people from falling into scams. In this work, an approach for identifying abused domains was developed using passive DNS collected from an authoritative DNS server TLD along with the data enriched through geolocation, thus enabling a global view of the domains. Therefore, the system monitors the domain’s first seven days of life after its first DNS query, in which two behavior checks are performed, the first with three days and the second with seven days. The generated models apply the machine learning algorithm LightGBM, and because of the unbalanced data, the combination of Cluster Centroids and K-Means SMOTE techniques were used. As a result, it obtained an average AUC of 0.9673 for the three-day model and an average AUC of 0.9674 for the seven-day model. Finally, the validation of three and seven days in a test environment reached a TPR of 0.8656 and 0.8682, respectively. It was noted that the system has a satisfactory performance for the early identification of abused domains and the importance of a TLD to identify these domains.
... The Active DNS Project [225] is currently collecting A records of about 300M domains derived from 1.3K zone files on a daily basis. In addition to this effort, Rapid7 [12] also conducts active DNS measurements at a large scale and offers researchers access to its data. ...
Thesis
With the Internet having become an indispensable means of communication in modern society, censorship and surveillance in cyberspace are getting more prevalent. Malicious actors around the world, ranging from nation states to private organizations, are increasingly making use of technologies to not only control the free flow of information, but also eavesdrop on Internet users' online activities. Internet censorship and online surveillance have led to severe human rights violations, including the freedom of expression, the right to information, and privacy. In this dissertation, we present two related lines of research that seek to tackle the twin problems of Internet censorship and online surveillance via an empirical lens. We show that empirical network measurement, when conducted at scale and in a longitudinal manner, is an essential approach to gain insights into (1) censors' blocking behaviors and (2) key characteristics of anti-censorship and privacy-enhancing technologies. These insights can then be used to not only aid in the development of effective censorship circumvention tools, but also help related stakeholders making informed decisions to maximize the privacy benefits of privacy-enhancing technologies. With a focus on measuring Internet censorship, we first conduct an empirical study of the I2P anonymity network, shedding light on important properties of the network and its censorship resistance. By measuring the state of I2P censorship around the globe, we then expose numerous censorship regimes (e.g., China, Iran, Oman, Qatar, and Kuwait) where I2P are blocked by various techniques. As a result of this work, I2P has adopted DNS over HTTPS, which is one of the domain name encryption protocols introduced recently, to prevent passive snooping and make the bootstrapping process more resistant to DNS-based network filtering and surveillance. Of the censors discovered above, we find that China is the most sophisticated one, having developed an advanced network filtering system, known as the Great Firewall (GFW). Continuing the same line of work, we have developed GFWatch, a large-scale, longitudinal measurement platform capable of testing hundreds of millions of domains daily, enabling continuous monitoring of the DNS filtering behavior of China's GFW. Data collected by GFWatch does not only cast new light on technical observations, but also timely inform the public about changes in the GFW’s blocking policy and assist other detection and circumvention efforts. We then focus on measuring and improving the privacy benefits provided by domain name encryption technologies, such as DNS over TLS (DoT), DNS over HTTPS (DoH), and Encrypted Client Hello (ECH). Although the security benefits of these technologies are clear, their positive impact on user privacy is weakened by—the still exposed—IP address information. We assess the privacy benefits of these new technologies by considering the relationship between hostnames and their hosting IP addresses. We show that encryption alone is not enough to protect web users' privacy. Especially when it comes to preventing nosy network observers from tracking users' browsing activities, the IP address information of remote servers being contacted is still visible, which can then be employed to infer the visited websites. Our findings help raise awareness about the remaining effort that must be undertaken by related stakeholders (i.e., website owners and hosting providers) to ensure a meaningful privacy benefit from the universal deployment of domain name encryption technologies. Nevertheless, the benefits provided by DoT/DoH against threats ``under the recursive resolver'' come with the cost of trusting the DoT/DoH operator with the entire web browsing history of users. As a step towards mitigating the privacy concerns stemming from the exposure of all DNS resolutions of a user—effectively the user's entire domain-level browsing history—to an additional third-party entity, we proposed K-resolver, a resolution mechanism in which DNS queries are dispersed across multiple (K) DoH servers, allowing each of them to individually learn only a fraction (1/K) of a user's browsing history. Our experimental results show that our approach incurs negligible overhead while improving user privacy. Last, but not least, given that the visibility into plaintext domain information is lost due to the introduction of domain name encryption protocols, it is important to investigate whether and how network traffic of these protocols is interfered with by different Internet filtering systems. We created DNEye, a measurement system built on top of a network of distributed vantage points, which we used to study the accessibility of DoT/DoH and ESNI, and to investigate whether these protocols are tampered with by network providers (e.g., for censorship). We find evidence of blocking efforts against domain name encryption technologies in several countries, including China, Russia, and Saudi Arabia. On the bright side, we discover that domain name encryption can help with unblocking more than 55% and 95% of censored domains in China and other countries where DNS-based filtering is heavily employed.
... Some authors choose to work on detecting these domains in the pre-registration stage, in which the lexical issues of the domain and information about the registrant are analyzed [6]. When the active [7] or passive [8] DNS is analyzed, the approach used is in the post-registration stage. ...
... Moreover, DNS data related to real Internet users are difficult and expensive to obtain, since are limited in scope and time, and subjected to laws and regulations against privacy violations. For these reasons researchers often prefer to recur to active DNS analysis, where the lists of domain names to be queried are defined by the analyst through the so called domain seed, and the queries are performed algorithmically by a set of controlled computers [20]. Active DNS measurements highly relies on domain seeds and on the number and distribution of the IP addresses where DNS responses are read: a scarce domain seed can limit in scope and breadth the analysis, whilst a poor diversity in the locations of collection points might result in the missed detection of some of the IP addresses associated to an anomalous behaviour. ...
Article
When DNS was created, nobody expected that it would have become the base for the digital economy and a prime target for cybercriminals. And nobody expected that one main asset of the digital economy would have been users’ browsing habits, putting at risk their privacy. The DNS was designed and implemented according to speed, scalability, and reliability criteria, whereas security and privacy did not fit in the objectives. Although the first attacks were already conceived about thirty years ago, the DNS infrastructure -with a bunch of improvements but its original design -continues to play a pivotal role in enabling access to services, data and devices. And, despite the fairly widespread adoption of DNSSEC security extensions in recent years, DNS attacks are becoming more and more frequent, sophisticated and dangerous. They are global, varied, dynamic and can circumvent traditional security systems such as next-generation firewalls and data loss prevention systems. A revisitation of DNS assumptions has been proposed in very different ways, reflecting diverse point of views in terms of Internet governance and user freedom, and a great effort is in place by standardization bodies, industry consortia and academic research to converge toward an updated design and implementation. The present work overviews the most promising proposals, trying to shed some insight on the future of DNS.
... Several projects have released datasets as part of their results, and several studies have focused on the accumulation of datasets themselves. For example, Kountouras et al. [52] implemented a system called Thales, which creates massive amounts of malicious domain names by distilling multiple freely available sources, while Pearce et al. [53] developed a scalable, accurate, and ethical system, called Iris, which measures global name resolution and uses active manipulation to track the trends of domain names that evolve over time. In addition, Viglianisi et al. [54] referred to SysTaint, which facilitates reverse engineering of malware communications. ...
Article
Full-text available
Computer networks are facing serious threats from the emergence of malware with sophisticated DGAs (Domain Generation Algorithms). This type of DGA malware dynamically generates domain names by concatenating words from dictionaries for evading detection. In this paper, we propose an approach for identifying the callback communications of such dictionary-based DGA malware by analyzing their domain names at the word level. This approach is based on the following observations: These malware families use their own dictionaries and algorithms to generate domain names, and accordingly, the word usages of malware-generated domains are distinctly different from those of human-generated domains. Our evaluation indicates that the proposed approach is capable of achieving accuracy, recall, and precision as high as 0.9989, 0.9977, and 0.9869, respectively, when used with labeled datasets. We also clarify the functional differences between our approach and other published methods via qualitative comparisons. Taken together, these results suggest that malware-infected machines can be identified and removed from networks using DNS queries for detected malicious domain names as triggers. Our approach contributes to dramatically improving network security by providing a technique to address various types of malware encroachment.
... The dataset for this exploration comes from ActiveDNS [61]. This project out of Georgia Tech does daily active DNS lookups for millions of IP addresses and records the query results in a database. ...
Preprint
We study hypergraph visualization via its topological simplification. We explore both vertex simplification and hyperedge simplification of hypergraphs using tools from topological data analysis. In particular, we transform a hypergraph to its graph representations known as the line graph and clique expansion. A topological simplification of such a graph representation induces a simplification of the hypergraph. In simplifying a hypergraph, we allow vertices to be combined if they belong to almost the same set of hyperedges, and hyperedges to be merged if they share almost the same set of vertices. Our proposed approaches are general, mathematically justifiable, and they put vertex simplification and hyperedge simplification in a unifying framework.
... Active DNS Dataset: We need to obtain a large amount of historical data combining domain names and their NS records. In general, such data can be collected using two approaches, passive DNS or active DNS [8]. Passive DNS is an approach to the passive collection of DNS queries and responses collected over a large network. ...
... d) E-mail: m.uchida@waseda.jp DOI: 10.1587/transcom.2020CQP0007 who accesses the parked domain name visits the website and can see the advertisements and click on them. ...
Article
Full-text available
On the Internet, there are lots of unused domain names that are not used for any actual services. Domain parking is a monetization mechanism for displaying online advertisements in such unused domain names. Some domain names used in cyber attacks are known to leverage domain parking services after the attack. However, the temporal relationships between domain parking services and malicious domain names have not been studied well. In this study, we investigated how malicious domain names using domain parking services change over time. We conducted a large-scale measurement study of more than 66.8 million domain names that have used domain parking services in the past 19 months. We reveal the existence of 3,964 domain names that have been malicious after using domain parking. We further identify what types of malicious activities (e.g., phishing and malware) such malicious domain names tend to be used for. We also reveal the existence of 3.02 million domain names that utilized multiple parking services simultaneously or while switching between them. Our study can contribute to the efficient analysis of malicious domain names using domain parking services.
... Several studies have focused specifically on collecting DNS datasets that alleviate the bias caused by various factors. For example, Kountouras et al. [36] implemented a system called Thales, that creates massive amounts of malicious domain names by distilling multiple freely available sources. Pearce et al. [37] developed a scalable, accurate, and ethical system, called Iris, that measures global name resolution and uses active manipulation to track the trends of domain names that evolve over time. ...
Article
Full-text available
Some of the most serious security threats facing computer networks involve malware. To prevent malware-related damage, administrators must swiftly identify and remove the infected machines that may reside in their networks. However, many malware families have domain generation algorithms (DGAs) to avoid detection. A DGA is a technique in which the domain name is changed frequently to hide the callback communication from the infected machine to the command-and-control server. In this article, we propose an approach for estimating the randomness of domain names by superficially analyzing their character strings. This approach is based on the following observations: human-generated benign domain names tend to reflect the intent of their domain registrants, such as an organization, product, or content. In contrast, dynamically generated malicious domain names consist of meaningless character strings because conflicts with already registered domain names must be avoided; hence, there are discernible differences in the strings of dynamically generated and human-generated domain names. Notably, our approach does not require any prior knowledge about DGAs. Our evaluation indicates that the proposed approach is capable of achieving recall and precision as high as 0.9960 and 0.9029, respectively, when used with labeled datasets. Additionally, this approach has proven to be highly effective for datasets collected via a campus network. Thus, these results suggest that malware-infected machines can be swiftly identified and removed from networks using DNS queries for detected malicious domains as triggers.
... Another approach argued that a Cryptography-based Prefix preserving Anonymization algorithm [37] or other encryption techniques that would secure the IP prefix [38] should be employed. At the other end of the spectrum, an entirely different solution was proposed: the collection of active DNS data [39]. This was made possible by creating a system called Thales which can systematically query and collect large volumes of active DNS data using as input an aggregation of publicly accessible sources of domain names and URLs that have been collected for several years by the research team. ...
Preprint
Full-text available
The Domain Name System (DNS) was created to resolve the IP addresses of the web servers to easily remembered names. When it was initially created, security was not a major concern; nowadays, this lack of inherent security and trust has exposed the global DNS infrastructure to malicious actors. The passive DNS data collection process creates a database containing various DNS data elements, some of which are personal and need to be protected to preserve the privacy of the end users. To this end, we propose the use of distributed ledger technology. We use Hyperledger Fabric to create a permissioned blockchain, which only authorized entities can access. The proposed solution supports queries for storing and retrieving data from the blockchain ledger, allowing the use of the passive DNS database for further analysis, e.g. for the identification of malicious domain names. Additionally, it effectively protects the DNS personal data from unauthorized entities, including the administrators that can act as potential malicious insiders, and allows only the data owners to perform queries over these data. We evaluated our proposed solution by creating a proof-of-concept experimental setup that passively collects DNS data from a network and then uses the distributed ledger technology to store the data in an immutable ledger, thus providing a full historical overview of all the records.
... Another approach argued that a Cryptography-based Prefix preserving Anonymization algorithm [37] or other encryption techniques that would secure the IP prefix [38] should be employed. At the other end of the spectrum, an entirely different solution was proposed: the collection of active DNS data [39]. This was made possible by creating a system called Thales which can systematically query and collect large volumes of active DNS data using as input an aggregation of publicly accessible sources of domain names and URLs that have been collected for several years by the research team. ...
Article
Full-text available
The Domain Name System (DNS) was created to resolve the IP addresses of web servers to easily remembered names. When it was initially created, security was not a major concern; nowadays, this lack of inherent security and trust has exposed the global DNS infrastructure to malicious actors. The passive DNS data collection process creates a database containing various DNS data elements, some of which are personal and need to be protected to preserve the privacy of the end users. To this end, we propose the use of distributed ledger technology. We use Hyperledger Fabric to create a permissioned blockchain, which only authorized entities can access. The proposed solution supports queries for storing and retrieving data from the blockchain ledger, allowing the use of the passive DNS database for further analysis, e.g., for the identification of malicious domain names. Additionally, it effectively protects the DNS personal data from unauthorized entities, including the administrators that can act as potential malicious insiders, and allows only the data owners to perform queries over these data. We evaluated our proposed solution by creating a proof-of-concept experimental setup that passively collects DNS data from a network and then uses the distributed ledger technology to store the data in an immutable ledger, thus providing a full historical overview of all the records.
... Passive DNS data provides a summarized view of domain queries. Experiments have shown that active DNS data provides more kinds of records, and passive DNS data provides a tighter connection graph [22]. Passive DNS can provide richer information than active DNS, but due to privacy issues and the location of the deployed sensors, the collected data has certain limitations. ...
Chapter
The Domain Name System (DNS) as the foundation of Internet, has been widely used by cybercriminals. A lot of malicious domain detection methods have received significant success in the past decades. However, existing detection methods usually use classification-based and association-based representations, which are not capable of dealing with the imbalanced problem between malicious and benign domains. To solve the problem, we propose a novel domain detection system named KSDom. KSDom designs a data collector to collect a large number of DNS traffic data and rich external DNS-related data, then employs K-means and SMOTE method to handle the imbalanced data. Finally, KSDom uses Categorical Boosting (CatBoost) algorithm to identify malicious domains. Comprehensive experimental results clearly show the effectiveness of our KSDom system and prove its good robustness in imbalanced datasets with different ratios. KSDom still has high accuracy even in extremely imbalanced DNS traffic.
... Several studies have surely focused on collecting DNS datasets. VOLUME 4, 2016 For example, Kountouras et al. [54] implemented a system, Thales, that creates massive amounts of malicious domain names by distilling freely available and multiple sources. Pearce et al. [55] developed a scalable, accurate, and ethical system, Iris, that measures global name resolution with active manipulation for tracking the trends of domain names that evolve over time. ...
Article
Full-text available
Some of the most serious security threats facing computer networks involve malware. To prevent this threat, administrators need to swiftly remove the infected machines from their networks. One common way to detect infected machines in a network is by monitoring communications based on blacklists. However, detection using this method has the following two problems: no blacklist is completely reliable, and blacklists do not provide sufficient evidence to allow administrators to determine the validity and accuracy of the detection results. Therefore, simply matching communications with blacklist entries is insufficient, and administrators should pursue their detection causes by investigating the communications themselves. In this paper, we propose an approach for classifying malicious DNS queries detected through blacklists by their causes. This approach is motivated by the following observation: a malware communication is divided into several transactions, each of which generates queries related to the malware; thus, surrounding queries that occur before and after a malicious query detected through blacklists help in estimating the cause of the malicious query. Our cause-based classification drastically reduces the number of malicious queries to be investigated because the investigation scope is limited to only representative queries in the classification results. In experiments, we have confirmed that our approach could group 388 malicious queries into 3 clusters, each consisting of queries with a common cause. These results indicate that administrators can briefly pursue all the causes by investigating only representative queries of each cluster, and thereby swiftly address the problem of infected machines in the network.
... While there has been a lot of work on general DNS measurements [30], [31], transparency [32], operation [33], and security [34], there are only a few studies considering the introduced new gTLDs. The launch of new gTLDs has expanded the top-level domains used in global Internet. ...
Article
Full-text available
The centralized zone data service (CZDS) was introduced by the Internet Corporation for Assigned Names and Numbers (ICANN) to facilitate sharing and access to zone data of the new generic Top-Level Domains (gTLDs). CZDS aims to improve the security and transparency of the naming system of the Internet. In this paper, we investigate CZDS’s transparency by measurement and evaluation. By requesting access to zone data of all gTLDs listed in the CZDS portal, we analyze various aspects of CZDS, including access status, responsiveness and provided reasons for granting access or denial. Among other findings, we find that while a large percent of the gTLD admins respond within a reasonable time, more than 10% of them have a long request-to-decision waiting time, and sometimes requests go unanswered even after six months of a request. Furthermore, we find that denial cases were for unjustified reasons, where administrators who denied the requests have asked for information that was already provided in the request form. We discuss implications, and how to enforce better outcomes of CZDS using insight from our measurement and evaluation.
... However, if this is not the case in the future, it would be easy to find alternatives. Given that the important features we extract depend more on the temporal registration information than the contact details of the registrant, we could replace our WHOIS features with DNS tracking systems like Active DNS [23] or the Alembic system [25]. ...
Conference Paper
Full-text available
Modern malware typically makes use of a domain generation algorithm (DGA) to avoid command and control domains or IPs being seized or sinkholed. This means that an infected system may attempt to access many domains in an attempt to contact the command and control server. Therefore, the automatic detection of DGA domains is an important task, both for the sake of blocking malicious domains and identifying compromised hosts. However, many DGAs use English wordlists to generate plausibly clean-looking domain names; this makes automatic detection difficult. In this work, we devise a notion of difficulty for DGA families called the smashword score; this measures how much a DGA family looks like English words. We find that this measure accurately reflects how much a DGA family's domains look like they are made from natural English words. We then describe our new modeling approach, which is a combination of a novel recurrent neural network architecture with domain registration side information. Our experiments show the model is capable of effectively identifying domains generated by difficult DGA families. Our experiments also show that our model outperforms existing approaches, and is able to reliably detect difficult DGA families such as matsnu, suppobox, rovnix, and others. The model's performance compared to the state of the art is best for DGA families that resemble English words. We believe that this model could either be used in a standalone DGA domain detector---such as an endpoint security application---or alternately the model could be used as a part of a larger malware detection system.
... ere are many additional data sources which are now publicly available, but most are not suitable as-is for training and testing these systems. Some are more specialized datasets, such as the Active DNS Project [35], many contain only malicious tra c, and some data sources are both specialized and malicious, such as only containing peer-to-peer botnet command and control tra c. ...
Preprint
Metrics and frameworks to quantifiably assess security measures have arisen from needs of three distinct research communities - statistical measures from the intrusion detection and prevention literature, evaluation of cyber exercises, e.g., red-team and capture-the-flag competitions, and economic analyses addressing cost-versus-security tradeoffs. In this paper we provide two primary contributions to the security evaluation literature - a representative survey, and a novel framework for evaluating security that is flexible, applicable to all three use cases, and readily interpretable. In our survey of the literature we identify the distinct themes from each community's evaluation procedures side by side and flesh out the drawbacks and benefits of each. Next, we provide a framework for evaluating security by comprehensively modeling the resource, labor, and attack costs in dollars incurred based on quantities, accuracy metrics, and time. This framework is a more "holistic" approach in that it incorporates the accuracy and performance metrics, which dominate intrusion detection evaluation, the time to detection and impact to data and resources of an attack, favored by educational competitions' metrics, and the monetary cost of many essential security components used in financial analysis. Moreover, it is flexible enough to accommodate each use case, easily interpretable, and comprehensive in terms of costs considered. Finally, we provide two examples of the framework applied to real-world use cases. Overall, we provide a survey and a grounded, flexible framework and multiple concrete examples for evaluating security that addresses the needs of three, currently distinct communities.
... This is similar to [8], in which a dataset is constructed and features are determined both via typical measures such as IP address, number of transmitted packets, and ports, as well as more intensive methods such as a clustering of NetFlows. In [9], a comparative evaluation of different activation functions is performed on the UNSW-NB15 dataset. Various models are trained with different activation functions and configurations, including Tanh, Tanh with dropout, Rectifier, Rectifier with dropout, Maxout and Maxout with dropout. ...
Article
The Domain Name System (DNS) plays a crucial role in connecting services and users on the Internet. Since its first specification, DNS has been extended in numerous documents to keep it fit for today’s challenges and demands. And these challenges are many. Revelations of snooping on DNS traffic led to changes to guarantee confidentiality of DNS queries. Attacks to forge DNS traffic led to changes to shore up the integrity of the DNS. Finally, denial-of-service attack on DNS operations have led to new DNS operations architectures. All of these developments make DNS a highly interesting, but also highly challenging research topic. This tutorial – aimed at graduate students and early-career researchers – provides a overview of the modern DNS, its ongoing development and its open challenges. This tutorial has four major contributions. We first provide a comprehensive overview of the DNS protocol. Then, we explain how DNS is deployed in practice. This lays the foundation for the third contribution: a review of the biggest challenges the modern DNS faces today and how they can be addressed. These challenges are (i) protecting the confidentiality and (ii) guaranteeing the integrity of the information provided in the DNS, (iii) ensuring the availability of the DNS infrastructure, and (iv) detecting and preventing attacks that make use of the DNS. Last, we discuss which challenges remain open, pointing the reader towards new research areas.
Article
Full-text available
URL redirection has become an important tool for adversaries to cover up their malicious campaigns. In this paper, we conduct the first large-scale measurement study on how adversaries leverage URL redirection to circumvent security checks and distribute malicious content in practice. To this end, we design an iteratively running framework to mine the domains used for malicious redirections constantly. First, we use a bipartite graph-based method to dig out the domains potentially involved in malicious redirections from real-world DNS traffic. Then, we dynamically crawl these suspicious domains and recover the corresponding redirection chains from the crawler’s performance log. Based on the collected redirection chains, we analyze the working mechanism of various malicious redirections, involving the abused modes and methods, and highlight the pervasiveness of node sharing. Notably, we find a new redirection abuse, redirection fluxing, which is abused to enhance the concealment of malicious sites by introducing randomness into the redirection. Our case studies reveal the adversary’s preference for abusing JavaScript methods to conduct redirection, even by introducing time-delay and fabricating user clicks to simulate normal users.
Article
We study hypergraph visualization via its topological simplification. We explore both vertex simplification and hyperedge simplification of hypergraphs using tools from topological data analysis. In particular, we transform a hypergraph into its graph representations known as the line graph and clique expansion. A topological simplification of such a graph representation induces a simplification of the hypergraph. In simplifying a hypergraph, we allow vertices to be combined if they belong to almost the same set of hyperedges, and hyperedges to be merged if they share almost the same set of vertices. Our proposed approaches are general, mathematically justifiable, and put vertex simplification and hyperedge simplification in a unifying framework.
Chapter
Most desktop applications use the network, and insecure communications can have a significant impact on the application, the system, the user, and the enterprise. Understanding at scale whether desktop application use the network securely is a challenge because the application provenance of a given network packet is rarely available at centralized collection points. In this paper, we collect flow data from 39,758 MacOS devices on an enterprise network to study the network behaviors of individual applications. We collect flows locally on-device and can definitively identify the application responsible for every flow. We also develop techniques to distinguish “endogenous” flows common to most executions of a program from “exogenous” flows likely caused by unique inputs. We find that popular MacOS applications are in fact using the network securely, with 95.62% of the applications we study using HTTPS. Notably, we observe security sensitive-services (including certificate management and mobile device management) do not use ports associated with secure communications. Our study provides important insights for users, device and network administrators, and researchers interested in secure communication.
Article
Malicious websites often mimic top brands to host malware and launch social engineering attacks, e.g., to collect user credentials. Some such sites often attempt to hide malicious content from search engine crawlers (e.g., Googlebot), but show harmful content to users/client browsers—a technique known as cloaking. Past studies uncovered various aspects of cloaking, using selected categories of websites (e.g., mimicking specific types of malicious sites). We focus on understanding cloaking behaviors using a broader set of websites. As a way forward, we built a crawler to automatically browse and analyze content from 100000 squatting (mostly) malicious domains—domains that are generated through typo-squatting and combo-squatting of 2883 popular websites. We use a headless Chrome browser and a search-engine crawler with user-agent modifications to identify cloaking behaviors—a challenging task due to dynamic content, served at random; e.g., consecutive requests serve very different malicious or benign content. Most malicious sites (e.g., phishing and malware) go undetected by current blacklists; only a fraction of cloaked sites (127, 3.3%) are flagged as malicious by VirusTotal. In contrast, we identify 80% cloaked sites as malicious, via a semi-automated process implemented by extending the content categorization functionality of Symantec’s SiteReview tool. Even after 3 months of observation, nearly a half (1024, 45.4%) of the cloaked sites remained active, and only a few (31, 3%) of them are flagged by VirusTotal. This clearly indicate that existing blacklists are ineffective against cloaked malicious sites. Our techniques can serve as a starting point for more effective and scalable early detection of cloaked malicious sites.
Article
The Domain Name System (DNS) is one of the most important components of today’s Internet, and is the standard naming convention between human-readable domain names and machine-routable Internet Protocol (IP) addresses of Internet resources. However, due to the vulnerability of DNS to various threats, its security and functionality have been continuously challenged over the course of time. Although, researchers have addressed various aspects of the DNS in the literature, there are still many challenges yet to be addressed. In order to comprehensively understand the root causes of the vulnerabilities of DNS, it is mandatory to review the various activities in the research community on DNS landscape. To this end, this paper surveys more than 170 peer reviewed papers, which are published in both top conferences and journals in last ten years, and summarizes vulnerabilities in DNS and corresponding countermeasures. This paper not only focuses on the DNS threat landscape and existing challenges, but also discusses the utilized data analysis methods, which are frequently used to address DNS threat vulnerabilities. Furthermore, we looked into the DNS threat landscape from the view point of the involved entities in the DNS infrastructure in an attempt to point out more vulnerable entities in the system.
Article
This survey focuses on intrusion detection systems (IDS) that leverage host-based data sources for detecting attacks on enterprise network. The host-based IDS (HIDS) literature is organized by the input data source, presenting targeted sub-surveys of HIDS research leveraging system logs, audit data, Windows Registry, file systems, and program analysis. While system calls are generally included in audit data, several publicly available system call datasets have spawned a flurry of IDS research on this topic, which merits a separate section. To accommodate current researchers, a section giving descriptions of publicly available datasets is included, outlining their characteristics and shortcomings when used for IDS evaluation. Related surveys are organized and described. All sections are accompanied by tables concisely organizing the literature and datasets discussed. Finally, challenges, trends, and broader observations are throughout the survey and in the conclusion along with future directions of IDS research. Overall, this survey was designed to allow easy access to the diverse types of data available on a host for sensing intrusion, the progressions of research using each, and the accessible datasets for prototyping in the area.
Chapter
The Domain Name System is a critical piece of infrastructure that has expanded into use cases beyond its original intent. DNS TXT records are intentionally very permissive in what information can be stored there, and as a result are often used in broad and undocumented ways to support Internet security and networked applications. In this paper, we identified and categorized the patterns in TXT record use from a representative collection of resource record sets. We obtained the records from a data set containing 1.4 billion TXT records collected over a 2 year period and used pattern matching to identify record use cases present across multiple domains. We found that 92% of these records generally fall into 3 categories; protocol enhancement, domain verification, and resource location. While some of these records are required to remain public, we discovered many examples that unnecessarily reveal domain information or present other security threats (e.g., amplification attacks) in conflict with best practices in security.
Conference Paper
Full-text available
Fast Internet-wide scanning has opened new avenues for security research, ranging from uncovering widespread vulnerabilities in random number generators to tracking the evolving impact of Heartbleed. However, this technique still requires significant effort: even simple questions, such as, "What models of embedded devices prefer CBC ciphers?", require developing an application scanner, manually identifying and tagging devices, negotiating with network administrators, and responding to abuse complaints. In this paper, we introduce Censys, a public search engine and data processing facility backed by data collected from ongoing Internet-wide scans. Designed to help researchers answer security-related questions, Censys supports full-text searches on protocol banners and querying a wide range of derived fields (e.g., 443.https.cipher). It can identify specific vulnerable devices and networks and generate statistical reports on broad usage patterns and trends. Censys returns these results in sub-second time, dramatically reducing the effort of understanding the hosts that comprise the Internet. We present the search engine architecture and experimentally evaluate its performance. We also explore Censys's applications and show how questions asked in recent studies become simple to answer.
Conference Paper
Full-text available
Many botnet detection systems employ a blacklist of known command and control (C&C) domains to detect bots and block their traffic. Similar to signature-based virus detection, such a botnet detection approach is static because the blacklist is updated only after running an external (and often manual) process of domain discovery. As a response, botmasters have begun employing domain generation algorithms (DGAs) to dynamically produce a large number of random domain names and select a small subset for actual C&C use. That is, a C&C domain is randomly generated and used for a very short period of time, thus rendering detection approaches that rely on static domain lists ineffective. Naturally, if we know how a domain generation algorithm works, we can generate the domains ahead of time and still identify and block bot-net C&C traffic. The existing solutions are largely based on reverse engineering of the bot malware executables, which is not always feasible. In this paper we present a new technique to detect randomly generated domains without reversing. Our insight is that most of the DGA-generated (random) domains that a bot queries would result in Non-Existent Domain (NXDomain) responses, and that bots from the same bot-net (with the same DGA algorithm) would generate similar NXDomain traffic. Our approach uses a combination of clustering and classification algorithms. The clustering algorithm clusters domains based on the similarity in the make-ups of domain names as well as the groups of machines that queried these domains. The classification algorithm is used to assign the generated clusters to models of known DGAs. If a cluster cannot be assigned to a known model, then a new model is produced, indicating a new DGA variant or family. We implemented a prototype system and evaluated it on real-world DNS traffic obtained from large ISPs in North America. We report the discovery of twelve DGAs. Half of them are variants of known (botnet) DGAs, and the other half are brand new DGAs that have never been reported before.
Conference Paper
Full-text available
In recent years Internet miscreants have been leveraging the DNS to build malicious network infrastructures for malware command and control. In this paper we propose a novel detection system called Kopis for detecting malware-related domain names. Kopis passively monitors DNS traffic at the upper levels of the DNS hierarchy, and is able to accurately detect malware domains by analyzing global DNS query resolution patterns. Compared to previous DNS reputation systems such as Notos [3] and Exposure [4], which rely on monitoring traffic from local recursive DNS servers, Kopis offers a new vantage point and introduces new traffic features specifically chosen to leverage the global visibility obtained by monitoring network traffic at the upper DNS hierarchy. Unlike previous work Kopis enables DNS operators to independently (i.e., without the need of data from other networks) detect malware domains within their authority, so that action can be taken to stop the abuse. Moreover, unlike previous work, Kopis can detect malware domains even when no IP reputation information is available. We developed a proof-of-concept version of Kopis, and experimented with eight months of real-world data. Our experimental results show that Kopis can achieve high detection rates (e.g., 98.4%) and low false positive rates (e.g., 0.3% or 0.5%). In addition Kopis is able to detect new malware domains days or even weeks before they appear in public blacklists and security forums, and allowed us to discover the rise of a previously unknown DDoS botnet based in China.
Conference Paper
Full-text available
We present the first empirical study of fast-flux service networks (FFSNs), a newly emerging and still not widely- known phenomenon in the Internet. FFSNs employ DNS to establish a proxy network on compromised machines through which illegal online services can be hosted with very high availability. Through our measurements we show that the threat which FFSNs pose is significant: FFSNs oc- cur on a worldwide scale and already host a substantial percentage of online scams. Based on analysis of the prin- ciples of FFSNs, we develop a metric with which FFSNs can be effectively detected. Considering our detection technique we also discuss possible mitigation strategies.
Conference Paper
Full-text available
Researchers have recently noted (14; 27) the potential of fast poisoning attacks against DNS servers, which allows attackers to easily manipulate records in open recursive DNS resolvers. A vendor-wide upgrade mitigated but did not eliminate this attack. Further, existing DNS protection systems, including bailiwick-checking (12) and IDS-style filtration, do not stop this type of DNS poisoning. We therefore propose Anax, a DNS protection system that detects poisoned records in cache. Our system can observe changes in cached DNS records, and applies machine learning to classify these updates as malicious or benign. We describe our classification features and machine learning model selection process while noting that the proposed approach is easily integrated into existing local network protection systems. To evaluate Anax, we studied cache changes in a geographically diverse set of 300,000 open recursive DNS servers (ORDNSs) over an eight month period. Using hand-verified data as ground truth, evaluation of Anax showed a very low false positive rate (0.6% of all new resource records) and a high detection rate (91.9%).
Conference Paper
In this paper we study the structure of criminal networks, groups of related malicious infrastructures that work in concert to provide hosting for criminal activities. We develop a method to construct a graph of relationships between malicious hosts and identify the underlying criminal networks, using historic assignments in the DNS. We also develop methods to analyze these networks to identify general structural trends and devise strategies for effective remediation through takedowns. We then apply these graph construction and analysis algorithms to study the general threat landscape, as well as four cases of sophisticated criminal networks. Our results indicate that in many cases, criminal networks can be taken down by de-registering as few as five domain names, removing critical communication links. In cases of sophisticated criminal networks, we show that our analysis techniques can identify hosts that are critical to the network’s functionality and estimate the impact of performing network takedowns in remediating the threats. In one case, disabling 20% of a criminal network’s hosts would reduce the overall volume of successful DNS lookups to the criminal network by as much as 70%. This measure can be interpreted as an estimate of the decrease in the number of potential victims reaching the criminal network that would be caused by such a takedown strategy.
Conference Paper
In this paper, we present an analysis of a new class of domain names: disposable domains. We observe that popular web applications, along with other Internet services, systematically use this new class of domain names. Disposable domains are likely generated automatically, characterized by a 'one-time use' pattern, and appear to be used as a way of 'signaling' via DNS queries. To shed light on the pervasiveness of disposable domains, we study 24 days of live DNS traffic spanning a year observed at a large Internet Service Provider. We find that disposable domains increased from 23.1% to 27.6% of all queried domains, and from 27.6% to 37.2% of all resolved domains observed daily. While this creative use of DNS may enable new applications, it may also have unanticipated negative consequences on the DNS caching infrastructure, DNSSEC validating resolvers, and passive DNS data collection systems.
Article
Overview The domain name system (abbreviated 'DNS') provides a distributed database that maps do-main names to record sets (for example, IP addresses). DNS is one of the core protocol suites of the Internet. Yet DNS data is often volatile, and there are many unwanted records present in the domain name system. This paper presents a technology, called passive DNS replication, to obtain domain name system data from production networks, and store it in a database for later reference. The present paper is structured as follows: • Section 1 briefly recalls a few DNS-related terms used throughout this paper. • Section 2 motivates the need for passive DNS replication: DNS itself does not allow cer-tain queries whose results are interesting in various contexts (mostly related to response to security incidents). • Section 3 describes the architecture and of the dnslogger software, an implementa-tion of passive DNS replication. • In section 4, successful applications of the technology are documented.
Article
The Botnet threats, such as server attacks or sending of spam email, have been increasing. A method of using a blacklist of domain names has been proposed to find infected hosts. However, not all infected hosts may be found by this method because a blacklist does not cover all black domain names. In this paper, we present a method for finding unknown black domain names and extend the blacklist by using DNS traffic data and the original blacklist of known black domain names. We use co-occurrence relation of two different domain names to find unknown black domain names and extend a blacklist. If a domain name co-occurs with a known black name frequently, we assume that the domain name is also black. We evaluate the proposed method by cross validation, about 91 % of domain names that are in the validation list can be found as top 1 %.
Conference Paper
In this paper we explore the potential of leveraging properties inherent to domain registrations and their appearance in DNS zone files to predict the malicious use of domains proactively, using only minimal observation of known-bad domains to drive our inference. Our analysis demonstrates that our inference procedure derives on average 3.5 to 15 new domains from a given known-bad domain. 93 % of these inferred domains subsequently appear suspect (based on third-party assessments), and nearly 73 % eventually appear on blacklists themselves. For these latter, proactively blocking based on our predictions provides a median headstart of about 2 days versus using a reactive blacklist, though this gain varies widely for different domains. 1
Conference Paper
Phishing has been easy and effective way for trickery and deception on the Internet. While solutions such as URL blacklisting have been effective to some degree, their reliance on exact match with the blacklisted entries makes it easy for attackers to evade. We start with the observation that attackers often employ simple modifications (e.g., changing top level domain) to URLs. Our system, PhishNet, exploits this observation using two components. In the first component, we propose five heuristics to enumerate simple combinations of known phishing sites to discover new phishing URLs. The second component consists of an approximate matching algorithm that dissects a URL into multiple components that are matched individually against entries in the blacklist. In our evaluation with real-time blacklist feeds, we discovered around 18,000 new phishing URLs from a set of 6,000 new blacklist entries. We also show that our approximate matching algorithm leads to very few false positives (3%) and negatives (5%).
Conference Paper
Malicious Web sites are a cornerstone of Internet criminal activi- ties. As a result, there has been broad interest in developing sys- tems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lex- ical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features poten- tially indicative of suspicious URLs. The resulting classifiers ob- tain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.
Conference Paper
We collected DNS responses at the University of Auckland Internet gateway in an SQL database, and analyzed them to detect un- usual behaviour. Our DNS response data have included typo squatter domains, fast flux domains and domains being (ab)used by spammers. We observe that current attempts to reduce spam have greatly increased the number of A records being resolved. We also observe that the data locality of DNS requests diminishes because of domains advertised in spam.
Conference Paper
An increasingly popular technique for decreasing user-perceived latency while browsing the Web is to optimistically pre-resolve (or prefetch) domain name resolutions. In this paper, we present a large-scale evaluation of this practice using data collected over the span of several months, and show that it leads to noticeable increases in load on name servers-with questionable caching benefits. Furthermore, to assess the impact that prefetching can have on the deployment of security extensions to DNS (DNSSEC), we use a custom-built cache simulator to perform trace-based simulations using millions of DNS requests and responses collected campus-wide. We also show that the adoption of domain name prefetching raises privacy issues. Specifically, we examine how prefetching amplifies information disclosure attacks to the point where it is possible to infer the context of searches issued by clients.
Conference Paper
The Domain Name System (DNS) is a one of the most widely used services in the Internet. In this paper, we consider the question of how DNS traffic monitoring can provide an important and useful perspective on network traffic in an enterprise. We approach this problem by considering three classes of DNS traffic: canonical (i.e., RFC-intended behaviors), overloaded (e.g.,black-list services), and unwanted (i.e., queries that will never succeed). We describe a context-aware clustering methodology that is applied to DNS query-responses to generate the desired aggregates. Our method enables the analysis to be scaled to expose the desired level of detail of each traffic type, and to expose their time varying characteristics. We implement our method in a tool we call TreeTop, which can be used to analyze and visualize DNS traffic in real-time. We demonstrate the capabilities of our methodology and the utility of TreeTop using a set of DNS traces that we collected from our campus network over a period of three months. Our evaluation highlights both the coarse and fine level of detail that can be revealed by our method. Finally, we show preliminary results on how DNS analysis can be coupled with general network traffic monitoring to provide a useful perspective for network management and operations.
Special Use IPv4 Addresses. RFC 5735 (Best Current Practice), Obsoleted by RFC 6890, updated by RFC 6598
  • M Cotton
  • L Vegoda
Snake in the grass: Python-based malware used for targeted attacks
  • B Coat
IANA-Reserved IPv4 Prefix for Shared Address Space
  • J Weil
  • V Kuarsingh
  • C Donley
  • C Liljenstolpe
  • M Azinger
Extending black domain name list by using co-occurrence relation between DNS queries
  • K Ishibashi
  • T Toyono
  • H Hasegawa
  • H Yoshino
Address Allocation for Private Internets. RFC 1918 (Best Current Practice), Updated by RFC 6761
  • Y Rekhter
  • B Moskowitz
  • D Karrenberg
  • G J De Groot
  • E Lear
  • L Daigle