Guanhua Yan

Los Alamos National Laboratory, Los Alamos, California, United States

Are you Guanhua Yan?

Claim your profile

Publications (59)11.46 Total impact

  • Deguang Kong, Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: The voluminous malware variants that appear in the Internet have posed severe threats to its security. In this work, we explore techniques that can automatically classify malware variants into their corresponding families. We present a generic framework that extracts structural information from malware programs as attributed function call graphs, in which rich malware features are encoded as attributes at the function level. Our framework further learns discriminant malware distance metrics that evaluate the similarity between the attributed function call graphs of two malware programs. To combine various types of malware attributes, our method adaptively learns the confidence level associated with the classification capability of each attribute type and then adopts an ensemble of classifiers for automated malware classification. We evaluate our approach with a number of Windows-based malware instances belonging to 11 families, and experimental results show that our automated malware classification method is able to achieve high classification accuracy.
    Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 08/2013
  • Guanhua Yan, Nathan Brown, Deguang Kong
    [Show abstract] [Hide abstract]
    ABSTRACT: The ever-growing malware threat in the cyber space calls for techniques that are more effective than widely deployed signature-based detection systems and more scalable than manual reverse engineering by forensic experts. To counter large volumes of malware variants, machine learning techniques have been applied recently for automated malware classification. Despite the successes made from these efforts, we still lack a basic understanding of some key issues, such as what features we should use and which classifiers perform well on malware data. Against this backdrop, the goal of this work is to explore discriminatory features for automated malware classification. We conduct a systematic study on the discriminative power of various types of features extracted from malware programs, and experiment with different combinations of feature selection algorithms and classifiers. Our results not only offer insights into what features most distinguish malware families, but also shed light on how to develop scalable techniques for automated malware classification in practice.
    Proceedings of the 10th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment; 07/2013
  • Nam P. Nguyen, Guanhua Yan, My T. Thai
    [Show abstract] [Hide abstract]
    ABSTRACT: With their blistering expansion in recent years, popular online social sites such as Twitter, Facebook and Bebo, have become not only one of the most effective channels for viral marketing but also the major news sources for many people nowadays. Alongside these promising features, however, comes the threat of misinformation propagation which can lead to undesirable effects. Due to the sheer number of online social network (OSN) users and the highly clustered structures commonly shared by these kinds of networks, there is a substantial challenge to efficiently contain viral spread of misinformation in large-scale social networks. In this paper, we focus on how to limit viral propagation of misinformation in OSNs. Particularly, we study a set of problems, namely the βTI-Node Protectors problems, which aim to find the smallest set of highly influential nodes from which disseminating good information helps to contain the viral spread of misinformation, initiated from a set of nodes I, within a desired fraction (1-β)(1-β) of the nodes in the entire network in T time steps. For this family of problems, we analyze and present solutions including their inapproximability results, greedy algorithms that provide better lower bounds on the number of selected nodes, and a community-based method for solving these problems. We further conduct a number of experiments on real-world traces, and the empirical results show that our proposed methods outperform existing alternative approaches in finding those important nodes that can help to contain the spread of misinformation effectively.
    Computer Networks. 07/2013; 57(10):2133–2146.
  • Deguang Kong, Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we explore techniques that can automatically classify malware variants into their corresponding families. Our framework extracts structural information from malware programs as attributed function call graphs, further learns discriminant malware distance metrics, finally adopts an ensemble of classifiers for automated malware classification. Experimental results show that our method is able to achieve high classification accuracy.
    Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems; 06/2013
  • Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to evade detection of ever-improving defense techniques, modern botnet masters are constantly looking for new communication platforms for delivering C&C (Command and Control) information. Attracting their attention is the emergence of online social networks such as Twitter, as the information dissemination mechanism provided by these networks can naturally be exploited for spreading botnet C&C information, and the enormous amount of normal communications co-existing in these networks makes it a daunting task to tease out botnet C&C messages.Against this backdrop, we explore graph-theoretic techniques that aid effective monitoring of potential botnet activities in large open online social networks. Our work is based on extensive analysis of a Twitter dataset that contains more than 40 million users and 1.4 billion following relationships, and mine patterns from the Twitter network structure that can be leveraged for improving efficiency of botnet monitoring. Our analysis reveals that the static Twitter topology contains a small-sized core sugraph, after removing which, the Twitter network breaks down into small connected components, each of which can be handily monitored for potential botnet activities. Based on this observation, we propose a method called Peri-Watchdog, which computes the core of a large online social network and derives the set of nodes that are likely to pass botnet C&C information in the periphery of online social network. We analyze the time complexity of Peri-Watchdog under its normal operations. We further apply Peri-Watchdog on the Twitter graph injected with synthetic botnet structures and investigate the effectiveness of Peri-Watchdog in detecting potential C&C information from these botnets.To verify whether patterns observed from the static Twitter graph are common to other online social networks, we analyze another online social network dataset, BrightKite, which contains evolution of social graphs formed by its users in half a year. We show not only that there exists a similarly relatively small core in the BrightKite network, but also this core remains stable over the course of BrightKite evolution. We also find that to accommodate the dynamic growth of BrightKite, the core has to be updated about every 18 days under a constrained monitoring capacity.
    Computer Networks. 02/2013; 57(2):540–555.
  • [Show abstract] [Hide abstract]
    ABSTRACT: As the first step of the communication procedure in 802.11, an unwise selection of the access point (AP) hurts one client's throughput. This performance downgrade is usually hard to be offset by other methods, such as efficient rate adaptations. In this paper, we study this AP selection problem in a decentralized manner, with the objective of maximizing the minimum throughput among all clients. We reveal through theoretical analysis that the selfish strategy, which commonly applies in decentralized systems, cannot effectively achieve this objective. Accordingly, we propose an online AP association strategy that not only achieves a minimum throughput (among all clients) that is provably close to the optimum, but also works effectively in practice with reasonable computation and transmission overhead. The association protocol applying this strategy is implemented on the commercial hardware and compatible with legacy APs without any modification. We demonstrate its feasibility and performance through real experiments and intensive simulations.
    IEEE Transactions on Parallel and Distributed Systems 01/2013; 24(12):2482-2491. · 1.80 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Proximity-based mobile social networking (PMSN) refers to the social interaction among physically proximate mobile users. The first step toward effective PMSN is for mobile users to choose whom to interact with. Profile matching refers to two users comparing their personal profiles and is promising for user selection in PMSN. It, however, conflicts with users' growing privacy concerns about disclosing their personal profiles to complete strangers. This paper tackles this open challenge by designing novel fine-grained private matching protocols. Our protocols enable two users to perform profile matching without disclosing any information about their profiles beyond the comparison result. In contrast to existing coarse-grained private matching schemes for PMSN, our protocols allow finer differentiation between PMSN users and can support a wide range of matching metrics at different privacy levels. The performance of our protocols is thoroughly analyzed and evaluated via real smartphone experiments.
    IEEE Journal on Selected Areas in Communications 01/2013; 31(9):656-668. · 3.12 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With a long history of compromising Internet security, Distributed Denial-of-Service (DDoS) attacks have been intensively investigated and numerous countermeasures have been proposed to defend against them. In this work, we propose a non-standard game-theoretic framework that facilitates evaluation of DDoS attacks and defense. Our framework can be used to study diverse DDoS attack scenarios where multiple layers of protection are deployed and a number of uncertain factors affect the decision making of the players, and it also allows us to model different sophistication levels of reasoning by both the attacker and the defender. We conduct a variety of experiments to evaluate DDoS attack and defense scenarios where one or more layers of defense mechanisms are deployed, and demonstrate that our framework sheds light on the interplay between decision makings of both the attacker and the defender, as well as how they affect the outcomes of DDoS attack and defense games.
    Proceedings of the 2012 ACM conference on Computer and communications security; 10/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: With their blistering expansions in recent years, popular on-line social sites such as Twitter, Facebook and Bebo, have become some of the major news sources as well as the most effective channels for viral marketing nowadays. However, alongside these promising features comes the threat of misinformation propagation which can lead to undesirable effects, such as the widespread panic in the general public due to faulty swine flu tweets on Twitter in 2009. Due to the huge magnitude of online social network (OSN) users and the highly clustered structures commonly observed in these kinds of networks, it poses a substantial challenge to efficiently contain viral spread of misinformation in large-scale social networks. In this paper, we focus on how to limit viral propagation of misinformation in OSNs. Particularly, we study a set of problems, namely the β1T -- Node Protectors, which aims to find the smallest set of highly influential nodes whose decontamination with good information helps to contain the viral spread of misinformation, initiated from the set I, to a desired ratio (1 − β) in T time steps. In this family set, we analyze and present solutions including inapproximability result, greedy algorithms that provide better lower bounds on the number of selected nodes, and a community-based heuristic method for the Node Protector problems. To verify our suggested solutions, we conduct experiments on real world traces including NetHEPT, NetHEPT_WC and Facebook networks. Empirical results indicate that our methods are among the best ones for hinting out those important nodes in comparison with other available methods.
    Proceedings of the 3rd Annual ACM Web Science Conference; 06/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Proximity-based mobile social networking (PMSN) refers to the social interaction among physically proximate mobile users directly through the Bluetooth/WiFi interfaces on their smartphones or other mobile devices. It becomes increasingly popular due to the recently explosive growth of smartphone users. Profile matching means two users comparing their personal profiles and is often the first step towards effective PMSN. It, however, conflicts with users' growing privacy concerns about disclosing their personal profiles to complete strangers before deciding to interact with them. This paper tackles this open challenge by designing a suite of novel fine-grained private matching protocols. Our protocols enable two users to perform profile matching without disclosing any information about their profiles beyond the comparison result. In contrast to existing coarse-grained private matching schemes for PMSN, our protocols allow finer differentiation between PMSN users and can support a wide range of matching metrics at different privacy levels. The security and communication/computation overhead of our protocols are thoroughly analyzed and evaluated via detailed simulations.
    Proceedings - IEEE INFOCOM 01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, tuning the clear channel assessment (CCA) threshold in conjunction with power control has been considered for improving the performance of WLANs. However, we show that, CCA tuning can be exploited by selfish nodes to obtain an unfair share of the available bandwidth. Specifically, a selfish entity can manipulate the CCA threshold to ignore ongoing transmissions; this increases the probability of accessing the medium and provides the entity a higher, unfair share of the bandwidth. We experiment on our 802.11 testbed to characterize the effects of CCA tuning on both isolated links and in 802.11 WLAN configurations. We focus on AP-client(s) configurations, proposing a novel approach to detect this misbehavior. A misbehaving client is unlikely to recognize low power receptions as legitimate packets; by intelligently sending low power probe messages, an AP can efficiently detect a misbehaving node. Our key contributions are: 1) We are the first to quantify the impact of selfish CCA tuning via extensive experimentation on various 802.11 configurations. 2) We propose a lightweight scheme for detecting selfish nodes that inappropriately increase their CCAs. 3) We extensively evaluate our system on our testbed; its accuracy is 95 percent while the false positive rate is less than 5 percent.
    IEEE Transactions on Mobile Computing 01/2012; 11:1086-1101. · 2.40 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the parameters (knobs) of distribution-based anomaly detection methods, and how their tuning affects the quality of detection. Specifically, we analyze the popular entropy-based anomaly detection in detecting covert channels in Voice over IP (VoIP) traffic. There has been little effort in prior research to rigorously analyze how the knobs of anomaly detection methodology should be tuned. Such analysis is, however, critical before such methods can be deployed by a practitioner. We develop a probabilistic model to explain the effects of the tuning of the knobs on the rate of false positives and false negatives. We then study the observations produced by our model analytically as well as empirically. We examine the knobs of window length and detection threshold. Our results show how the knobs should be set for achieving high rate of detection, while maintaining a low rate of false positives. We also show how the throughput of the covert channel (the magnitude of the anomaly) affects the rate of detection, thereby allowing a practitioner to be aware of the capabilities of the methodology.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: User association logs play an important role in wireless network research. One concern of sharing such logs with other researchers, however, is that they pose potential privacy risks for the network users. Today, the common practice in sanitizing these logs before releasing them to the public is to anonymize users' sensitive information, such as their devices' MAC addresses and their exact association locations. In this work, we aim to study whether such sanitization measures are sufficient to protect user privacy. By simulating an adversary's role, we propose a novel type of correlation attack in which the adversary uses the anonymized association log to build signatures against each user, and when combined with auxiliary information, such signatures can help to identify users within the anonymized log. Using a user association log that contains more than four thousand users and millions of association records, we demon- strate that this attack technique, under certain circumstances, is able to pinpoint the victim's identity exactly with a probability as high as 70%, or narrow it down to a set of 20 candidates with a probability close to 100%. We further evaluate the effectiveness of standard anonymization techniques, including generalization and perturbation, in mitigating correlation attacks; our experimental results reveal only limited success of these methods, suggesting that more thorough treatment is needed when anonymizing wireless user association logs before public release. I. INTRODUCTION In a nutshell, we make three major contributions in this work. First, we simulate the role of an adversary and propose a "correlation attack" - a method based on Conditional Random Field (CRF) - that can be used to breach user privacy from a released WLAN user association log. Second, we use extensive experiments to demonstrate the effectiveness of the CRF- based correlation attack. Using an anonymized campus-wide WLAN user association log with more than four thousand users and millions of user association records, and a short- term observation of the victim's association activities, we show that the CRF-based correlation attack, under certain circumstances, can reveal the victim's identity in the released dataset with a probability as high as 70%, or narrow down the victim's identity among 20 candidates with a probability close to 100%. Third, we evaluate the effectiveness of standard san- itization techniques, including generalization and perturbation, in mitigating the proposed correlation attack; the results reveal only limited success of these methods, suggesting that more thorough treatment is needed when anonymizing wireless user association logs before the public release.
    INFOCOM 2011. 30th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 10-15 April 2011, Shanghai, China; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Botnets are one of the most serious security threats to the Internet and its end users. In recent years, utilizing P2P as a Command and Control (C&C) protocol has become popular due to its decentralized nature that can help hide the botmaster's identity. Most bot detection approaches targeting P2P botnets either rely on behavior monitoring or traffic flow and packet analysis, requiring fine-grained information collected locally. This requirement limits the scale of detection. In this paper, we consider detection of P2P botnets at a high-level - the infrastructure level-by exploiting their structural properties from a graph analysis perspective. Using three different P2P overlay structures, we measure the effectiveness of detecting each structure at various locations (the Autonomous System (AS), the Point of Presence (PoP), and the router rendezvous) in the Internet infrastructure.
    19th International Workshop on Quality of Service, IWQoS 2011, San Jose, California, USA, 6-7 June 2011.; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we study some geographic aspects of the Internet. We base our analysis on a large set of geolocated IP hop-level session data (including about 300;000 backbone routers, 130 million end hosts, and one billion sessions) that we synthesized from a variety of different input sources such as US census data, computer usage statistics, Internet market share data, IP geolocation data sets, CAIDA's Skitter data set for backbone connectivity, and BGP routing tables. We use this model to perform a nationwide and statewide geographic analysis of the Internet. Our main observations are: (1) There is a dominant coast-to-coast pattern in the US Internet traffic. In fact, in many instances even if the end-devices are not near either coast, still the traffic between them takes a long detour through the coasts. (2) More than half of the Internet paths are inflated by 100% or more compared to their corresponding geometric straight-line distance. This circuitousness makes the average ratio between the routing distance and geometric distance big (around 10). (3) The weighted mean hop count is around 5, but the hop counts are very loosely correlated with the distances. The weighted mean AS count (number of ASes traversed) is around 3. I. INTRODUCTION Owing to its great importance, the Internet, has been a subject of a large number of studies. Much of the previous work has focused on studying topology of the Internet at the network level, without any regard to geography. In this paper, we perform a geography-based analysis of the Internet. Our main focus is on understanding the geographic properties of routing and the geographic structure of autonomous systems. Our conclusions provide new insights into the structure and functioning of the Internet. Our results are obtained using a very high fidelity model of the US Internet infrastructure that we create by combining various datasets. Our background topology is derived primarily from the CAIDA's Skitter dataset. We use the telegeography colocation database to obtain all the major point of presence locations in the US. We then simulate millions of end-devices and also billions of session-level traffic between these end- devices. The end-devices and the session traffic are generated in consultation with US census data, computer usage surveys, and market shares of various Internet service providers. For routing, we use an AS (autonomous system) path inference algorithm that uses realistic BGP tables to derive inter-domain paths. The level of authenticity captured by our model has rarely been achieved before. It is a well known fact that the Internet routes could be highly circuitous (1), (2). In this paper, we ask the question: How geographic is the Internet routing? We compute the travel distance between two end-points as the sum of the geometric (geographic) distance between the end-points of the various links on the path. For example, if the path from an end- device in Los Angeles to one in New York goes through San Francisco and Miami, the travel distance for this path is the sum of geometric distance from Los Angles to San Francisco, from San Francisco to Miami, and from Miami to New York. Our experiments show that a large fraction of the traffic travels through the east and/or the west coasts of the US. Consider two end-devices A and B and the traffic flowing from A to B. Let s and t be the locations of A and B, respectively. What we observe is that for many such pairs A and B, the packets from A travels (possibly multiple times) to the east and/or the west coast before reaching B and this is true even if neither A nor B are near either coasts. We observe this phenomenon both at the national level (entire US traffic) and the state level (traffic originating from some particular state). Looking at the ratio between travel and geometric distance, we observe more than 50% of the traffic has this ratio greater than 2 (i.e., the travel distance is at least twice the geometric distance) and about 20% of the traffic has this ratio greater than 4. One observes a similar behavior even if the traffic volume (number of bytes flowing across) is taken into account. For example, about 46% of the traffic volume our model generates are between end-devices that are less than 1000 miles apart, whereas, only 13% of the traffic volume have their travel distance less than 1000 miles. Another related question that we investigate is the spread of the hop and AS counts and their relationship with distance. Majority of the paths have hop count less than 6, and we found that the average hop count is near 5. The AS count (the number of ASes passed on the way) is almost always less than 3 and for most of the traffic it is around 2. Also, a bit surprising is the fact that the hop count is very loosely correlated with the geometric distance. For example, it is almost equally likely two end-devices that are 500 miles or 2000 miles apart will have a hop count of 5. A similar lack of correlation also holds between the hop count and travel distance.
    INFOCOM 2011. 30th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 10-15 April 2011, Shanghai, China; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Online social networks, which have been expanding at a blistering speed recently, have emerged as a popular communication infrastructure for Internet users. Meanwhile, malware that specifically target these online social networks are also on the rise. In this work, we aim to investigate the characteristics of malware propagation in online social networks. Our study is based on a dataset collected from a real-world location-based online social network, which includes not only the social graph formed by its users but also the users' activity events. We analyze the social structure and user activity patterns of this network, and confirm that it is a typical online social network, suggesting that conclusions drawn from this specific network can be translated to other online social networks. We use extensive trace-driven simulation to study the impact of initial infection, user click probability, social structure, and activity patterns on malware propagation in online social networks. We also investigate the performance of a few user-oriented and server-oriented defense schemes against malware spreading in online social networks and identify key factors that affect their effectiveness. We believe that this comprehensive study has deepened our understanding of the nature of online social network malware and also shed light on how to defend against them effectively.
    Proceedings of the 6th ACM Symposium on Information, Computer and Communications Security, ASIACCS 2011, Hong Kong, China, March 22-24, 2011; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Botnets have emerged as one of the most severe cyber-threats in recent years. To evade detection and improve resistance against countermeasures, botnets have evolved from the first generation that relies on IRC chat channels to deliver commands to the current generation that uses highly resilient P2P (peer-to-peer) protocols to spread their C&C (Command and Control) information. On an encouraging note, the seminal work done by Holz et al. [14] showed that P2P botnets, although relieved from the single point of failure that IRC botnets suffer, can be easily disrupted using pollution-based mitigation schemes.For white-hat cyber-security practitioners to be better prepared for potentially destructive P2P botnets, it is necessary for them to understand the strategy space from the attacker’s perspective. Against this backdrop, we analyze a new type of P2P botnets, which we call AntBot, that aims to spread their C&C information to individual bots even though an adversary persistently pollutes keys used by seized bots to search the C&C information. The tree-like structure of AntBot, together with the randomness and redundancy in its design, renders it possible that individual bots, when captured, reveal only limited information. We mathematically analyze the performance of AntBot from the perspectives of reachability, resilience to pollution, and scalability. To evaluate the effectiveness of AntBot against pollution-based mitigation in a practical setting, we develop a distributed high-fidelity P2P botnet simulator that uses the actual implementation code of aMule, a popular Kademlia-based P2P client. The simulator offers us a tool to evaluate the attacker’s strategy in the cyber space without causing ethical or legal issues, which may result from real-world deployment. Using extensive simulation, we demonstrate that AntBot operates resiliently against pollution-based mitigation. We further suggest a few potential defense schemes that could effectively disrupt AntBot operations and also present challenges that researchers need to address when developing these techniques in practice.
    Computer Networks. 01/2011; 55:1941-1956.
  • Source
    Chrisil Arackaparambil, Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: Wikipedia has become a standard source of refer- ence online, and many people (some unknowingly) now trust this corpus of knowledge as an authority to fulfil their information requirements. In doing so they task the human contributors of Wikipedia with maintaining the accuracy of articles, a job that these contributors have been performing admirably. We study the problem of monitoring the Wikipedia corpus with the goal of automated, online anomaly detection. We present Wiki-watchdog, an efficient distribution-based methodology that monitors distributions of revision activity for changes. We show that using our methods it is possible to detect the activity of bots, flash events, and outages, as they occur. Our methods are proposed to support the monitoring of the contributors. They are useful to speed-up anomaly detection, and identify events that are hard to detect manually. We show the efficacy and the low false-positive rate of our methods by experiments on the revision history of Wikipedia. Our results show that distribution-based anomaly detection has a higher detection rate than traditional methods based on either volume or entropy alone. Unlike previous work on anomaly detection in information networks that worked with a static network graph, our methods consider the network as it evolves and monitors properties of the network for changes. Although our methodology is developed and evaluated on Wikipedia, we believe it is an effective generic anomaly detection framework in its own right.
    Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, Campus Scientifique de la Doua, Lyon, France, August 22-27, 2011; 01/2011
  • Source
    Dong Jin, David M. Nicol, Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: The DNP3 protocol is widely used in SCADA systems (particularly electrical power) as a means of communicating observed sensor state information back to a control center. Typical architectures using DNP3 have a two level hierarchy, where a specialized data aggregator receives observed state from devices within a local region, and the control center collects the aggregated state from the data aggregator. The DNP3 communications are asynchronous across the two levels; this leads to the possibility of completely filling a data aggregator's buffer of pending events, when a compromised relay sends overly many (false) events to the data aggregator. This paper investigates the attack by implementing the attack using real SCADA system hardware and software. A Discrete-Time Markov Chain (DTMC) model is developed for understanding conditions under which the attack is successful and effective. The model is validated by a Möbius simulation model and data collected on a real SCADA testbed.
    Proceedings - Winter Simulation Conference 01/2011;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Botnets have emerged as one of the most severe cyber threats in recent years. To obtain high resilience against a single point of failure, the new generation of botnets have adopted the peer-to-peer (P2P) structure. One critical question regarding these P2P botnets is: how big are they indeed? To address this question, researchers have proposed both actively crawling and passively monitoring methods to enumerate existing P2P botnets. In this work, we go further to explore the potential strategies that botnets may have to obfuscate their true sizes. Towards this end, this paper introduces RatBot, a P2P botnet that applies some statistical techniques to defeat existing P2P botnet enumeration methods. The key ideas of RatBot are two-fold: (1) there exist a fraction of bots that are indistinguishable from their fake identities, which are spoofing IP addresses they use to hide themselves; (2) we use a heavy-tailed distribution to generate the number of fake identities for each of these bots so that the sum of observed fake identities converges only slowly and thus has high variation. We use large-scale high-fidelity simulation to quantify the estimation errors under diverse settings, and the results show that a naive enumeration technique can overestimate the sizes of P2P botnets by one order of magnitude. We believe that our work reveals new challenges of accurately estimating the sizes of P2P botnets, and hope that it will raise the awareness of security practitioners with these challenges. We further suggest a few countermeasures that can potentially defeat RatBot's anti-enumeration scheme.
    Information Security, 14th International Conference, ISC 2011, Xi'an, China, October 26-29, 2011. Proceedings; 01/2011

Publication Stats

349 Citations
11.46 Total Impact Points

Institutions

  • 2006–2013
    • Los Alamos National Laboratory
      • Computer, Computational, and Statistical Sciences Division
      Los Alamos, California, United States
  • 2010
    • George Mason University
      • Department of Computer Science
      Fairfax, VA, United States
  • 2004–2006
    • University of Illinois, Urbana-Champaign
      • Department of Electrical and Computer Engineering
      Urbana, Illinois, United States
  • 2003
    • Dartmouth College
      • Department of Computer Science
      Hanover, New Hampshire, United States