Guanhua Yan

Los Alamos National Laboratory, Лос-Аламос, California, United States

Are you Guanhua Yan?

Claim your profile

Publications (65)20.97 Total impact

  • Guanhua Yan · Stephan Eidenbenz
    [Show abstract] [Hide abstract]
    ABSTRACT: Graphs are widely used to characterize relationships or information flows among entities in large networks or distributed systems. In this work, we propose a systematic framework that leverages temporal similarity inherent in dynamic graphs for anomaly detection. This framework relies on the Neyman-Pearson criterion to choose similarity measures with high discriminative power for online anomaly detection in dynamic graphs. We formulate the problem rigorously, and after establishing its inapproximibility result, we develop a greedy algorithm for similarity measure selection. We apply this framework to dynamic graphs generated from email communications among thousands of employees in a large research institution and demonstrate that it works effectively on a set of more than 100 candidate graph similarity measures.
    2014 IEEE 34th International Conference on Distributed Computing Systems (ICDCS); 06/2014
  • Deguang Kong · Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: The numerous malware variants existing in the cyberspace have posed severe threats to its security. Supervised learning techniques have been applied to automate the process of classifying malware variants. Supervised learning, however, suffers in situations where we have only scarce labeled malware samples. In this work, we propose a transductive malware classification framework, which propagates label information from labeled instances to unlabeled ones. We improve the existing Harmonic function approach based on the maximum confidence principle. We apply this framework on the structural information collected from malware programs, and propose a PageRank-like algorithm to evaluate the distance between two malware programs. We evaluate the performance of our method against the standard Harmonic function method as well as two popular supervised learning techniques. Experimental results suggest that our method outperforms these existing approaches in classifying malware variants when only a small number of labeled samples are available.
    IEEE INFOCOM 2014 - IEEE Conference on Computer Communications; 04/2014
  • Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.
    2014 IEEE 30th International Conference on Data Engineering (ICDE); 03/2014
  • Fengyuan Xu · Xiaojun Zhu · Chiu C. Tan · Qun Li · Guanhua Yan · Jie Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: As the first step of the communication procedure in 802.11, an unwise selection of the access point (AP) hurts one client's throughput. This performance downgrade is usually hard to be offset by other methods, such as efficient rate adaptations. In this paper, we study this AP selection problem in a decentralized manner, with the objective of maximizing the minimum throughput among all clients. We reveal through theoretical analysis that the selfish strategy, which commonly applies in decentralized systems, cannot effectively achieve this objective. Accordingly, we propose an online AP association strategy that not only achieves a minimum throughput (among all clients) that is provably close to the optimum, but also works effectively in practice with reasonable computation and transmission overhead. The association protocol applying this strategy is implemented on the commercial hardware and compatible with legacy APs without any modification. We demonstrate its feasibility and performance through real experiments and intensive simulations.
    IEEE Transactions on Parallel and Distributed Systems 12/2013; 24(12):2482-2491. DOI:10.1109/TPDS.2013.10 · 2.17 Impact Factor
  • Source
    Milan Bradonjić · Michael Molloy · Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: Viral spread on large graphs has many real-life applications such as malware propagation in computer networks and rumor (or misinformation) spread in Twitter-like online social networks. Although viral spread on large graphs has been intensively analyzed on classical models such as Susceptible-Infectious-Recovered, there still exits a deficit of effective methods in practice to contain epidemic spread once it passes a critical threshold. Against this backdrop, we explore methods of containing viral spread in large networks with the focus on sparse random networks. The viral containment strategy is to partition a large network into small components and then to ensure the sanity of all messages delivered across different components. With such a defense mechanism in place, an epidemic spread starting from any node is limited to only those nodes belonging to the same component as the initial infection node. We establish both lower and upper bounds on the costs of inspecting inter-component messages. We further propose heuristic-based approaches to partition large input graphs into small components. Finally, we study the performance of our proposed algorithms under different network topologies and different edge weight models.
    Internet Mathematics 10/2013; 9(4). DOI:10.1080/15427951.2013.798600
  • Jinxue Zhang · Rui Zhang · Yanchao Zhang · Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: Online social networks (OSNs) are increasingly threatened by social bots which are software-controlled OSN accounts that mimic human users with malicious intentions. A social botnet refers to a group of social bots under the control of a single botmaster, which collaborate to conduct malicious behavior, while at the same time mimicking the interactions among normal OSN users to reduce their individual risk of being detected. We demonstrate the effectiveness and advantages of exploiting a social botnet for spam distribution and digital-influence manipulation through real experiments on Twitter and also trace-driven simulations. Our results can help understand the potentially detrimental effects of social botnets and help OSNs improve their bot(net) detection systems.
    2013 IEEE Conference on Communications and Network Security (CNS); 10/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Proximity-based mobile social networking (PMSN) refers to the social interaction among physically proximate mobile users. The first step toward effective PMSN is for mobile users to choose whom to interact with. Profile matching refers to two users comparing their personal profiles and is promising for user selection in PMSN. It, however, conflicts with users' growing privacy concerns about disclosing their personal profiles to complete strangers. This paper tackles this open challenge by designing novel fine-grained private matching protocols. Our protocols enable two users to perform profile matching without disclosing any information about their profiles beyond the comparison result. In contrast to existing coarse-grained private matching schemes for PMSN, our protocols allow finer differentiation between PMSN users and can support a wide range of matching metrics at different privacy levels. The performance of our protocols is thoroughly analyzed and evaluated via real smartphone experiments.
    IEEE Journal on Selected Areas in Communications 09/2013; 31(9):656-668. DOI:10.1109/JSAC.2013.SUP.0513057 · 4.14 Impact Factor
  • Deguang Kong · Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: The voluminous malware variants that appear in the Internet have posed severe threats to its security. In this work, we explore techniques that can automatically classify malware variants into their corresponding families. We present a generic framework that extracts structural information from malware programs as attributed function call graphs, in which rich malware features are encoded as attributes at the function level. Our framework further learns discriminant malware distance metrics that evaluate the similarity between the attributed function call graphs of two malware programs. To combine various types of malware attributes, our method adaptively learns the confidence level associated with the classification capability of each attribute type and then adopts an ensemble of classifiers for automated malware classification. We evaluate our approach with a number of Windows-based malware instances belonging to 11 families, and experimental results show that our automated malware classification method is able to achieve high classification accuracy.
    Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 08/2013
  • Guanhua Yan · Nathan Brown · Deguang Kong
    [Show abstract] [Hide abstract]
    ABSTRACT: The ever-growing malware threat in the cyber space calls for techniques that are more effective than widely deployed signature-based detection systems and more scalable than manual reverse engineering by forensic experts. To counter large volumes of malware variants, machine learning techniques have been applied recently for automated malware classification. Despite the successes made from these efforts, we still lack a basic understanding of some key issues, such as what features we should use and which classifiers perform well on malware data. Against this backdrop, the goal of this work is to explore discriminatory features for automated malware classification. We conduct a systematic study on the discriminative power of various types of features extracted from malware programs, and experiment with different combinations of feature selection algorithms and classifiers. Our results not only offer insights into what features most distinguish malware families, but also shed light on how to develop scalable techniques for automated malware classification in practice.
    Proceedings of the 10th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment; 07/2013
  • Nam P. Nguyen · Guanhua Yan · My T. Thai
    [Show abstract] [Hide abstract]
    ABSTRACT: With their blistering expansion in recent years, popular online social sites such as Twitter, Facebook and Bebo, have become not only one of the most effective channels for viral marketing but also the major news sources for many people nowadays. Alongside these promising features, however, comes the threat of misinformation propagation which can lead to undesirable effects. Due to the sheer number of online social network (OSN) users and the highly clustered structures commonly shared by these kinds of networks, there is a substantial challenge to efficiently contain viral spread of misinformation in large-scale social networks. In this paper, we focus on how to limit viral propagation of misinformation in OSNs. Particularly, we study a set of problems, namely the βTI-Node Protectors problems, which aim to find the smallest set of highly influential nodes from which disseminating good information helps to contain the viral spread of misinformation, initiated from a set of nodes I, within a desired fraction (1-β)(1-β) of the nodes in the entire network in T time steps. For this family of problems, we analyze and present solutions including their inapproximability results, greedy algorithms that provide better lower bounds on the number of selected nodes, and a community-based method for solving these problems. We further conduct a number of experiments on real-world traces, and the empirical results show that our proposed methods outperform existing alternative approaches in finding those important nodes that can help to contain the spread of misinformation effectively.
    Computer Networks 07/2013; 57(10):2133–2146. DOI:10.1016/j.comnet.2013.04.002 · 1.28 Impact Factor
  • Deguang Kong · Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we explore techniques that can automatically classify malware variants into their corresponding families. Our framework extracts structural information from malware programs as attributed function call graphs, further learns discriminant malware distance metrics, finally adopts an ensemble of classifiers for automated malware classification. Experimental results show that our method is able to achieve high classification accuracy.
    Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Traditional software and security patch update delivery mechanisms rely on a client/server approach where clients pull updates from servers regularly. This approach, however, suffers a high window of vulnerability (WOV) for clients and the risk of a single point of failure. Overlay-based information dissemination schemes overcome these problems, but often incur high infrastructure cost to set up and maintain individual information dissemination networks. Against this backdrop, we propose iDispatcher, a planet-scale, flexible and secure information dissemination platform. iDispatcher uses a hybrid approach with both push- and pull-based information dissemination to reduce the WOV period and achieve high distribution coverage. iDispatcher also uses a peer-to-peer based architecture to achieve higher scalability. We develop a self-contained key management mechanism for iDispatcher. Our prototype for iDispatcher is deployed on more than 500 PlanetLab nodes distributed around the world. Experimental results show that iDispatcher can have small dissemination latency for time-critical applications, is highly tunable to optimize the tradeoff between bandwidth and latency, and works resiliently against different attacks such as flooding attacks.
    Peer-to-Peer Networking and Applications 03/2013; 6(1). DOI:10.1007/s12083-012-0128-8 · 0.46 Impact Factor
  • Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to evade detection of ever-improving defense techniques, modern botnet masters are constantly looking for new communication platforms for delivering C&C (Command and Control) information. Attracting their attention is the emergence of online social networks such as Twitter, as the information dissemination mechanism provided by these networks can naturally be exploited for spreading botnet C&C information, and the enormous amount of normal communications co-existing in these networks makes it a daunting task to tease out botnet C&C messages.Against this backdrop, we explore graph-theoretic techniques that aid effective monitoring of potential botnet activities in large open online social networks. Our work is based on extensive analysis of a Twitter dataset that contains more than 40 million users and 1.4 billion following relationships, and mine patterns from the Twitter network structure that can be leveraged for improving efficiency of botnet monitoring. Our analysis reveals that the static Twitter topology contains a small-sized core sugraph, after removing which, the Twitter network breaks down into small connected components, each of which can be handily monitored for potential botnet activities. Based on this observation, we propose a method called Peri-Watchdog, which computes the core of a large online social network and derives the set of nodes that are likely to pass botnet C&C information in the periphery of online social network. We analyze the time complexity of Peri-Watchdog under its normal operations. We further apply Peri-Watchdog on the Twitter graph injected with synthetic botnet structures and investigate the effectiveness of Peri-Watchdog in detecting potential C&C information from these botnets.To verify whether patterns observed from the static Twitter graph are common to other online social networks, we analyze another online social network dataset, BrightKite, which contains evolution of social graphs formed by its users in half a year. We show not only that there exists a similarly relatively small core in the BrightKite network, but also this core remains stable over the course of BrightKite evolution. We also find that to accommodate the dynamic growth of BrightKite, the core has to be updated about every 18 days under a constrained monitoring capacity.
    Computer Networks 02/2013; 57(2):540–555. DOI:10.1016/j.comnet.2012.07.016 · 1.28 Impact Factor
  • Guanhua Yan · Ritchie Lee · Alex Kent · David Wolpert
    [Show abstract] [Hide abstract]
    ABSTRACT: With a long history of compromising Internet security, Distributed Denial-of-Service (DDoS) attacks have been intensively investigated and numerous countermeasures have been proposed to defend against them. In this work, we propose a non-standard game-theoretic framework that facilitates evaluation of DDoS attacks and defense. Our framework can be used to study diverse DDoS attack scenarios where multiple layers of protection are deployed and a number of uncertain factors affect the decision making of the players, and it also allows us to model different sophistication levels of reasoning by both the attacker and the defender. We conduct a variety of experiments to evaluate DDoS attack and defense scenarios where one or more layers of defense mechanisms are deployed, and demonstrate that our framework sheds light on the interplay between decision makings of both the attacker and the defender, as well as how they affect the outcomes of DDoS attack and defense games.
    Proceedings of the 2012 ACM conference on Computer and communications security; 10/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently, tuning the clear channel assessment (CCA) threshold in conjunction with power control has been considered for improving the performance of WLANs. However, we show that, CCA tuning can be exploited by selfish nodes to obtain an unfair share of the available bandwidth. Specifically, a selfish entity can manipulate the CCA threshold to ignore ongoing transmissions; this increases the probability of accessing the medium and provides the entity a higher, unfair share of the bandwidth. We experiment on our 802.11 testbed to characterize the effects of CCA tuning on both isolated links and in 802.11 WLAN configurations. We focus on AP-client(s) configurations, proposing a novel approach to detect this misbehavior. A misbehaving client is unlikely to recognize low power receptions as legitimate packets; by intelligently sending low power probe messages, an AP can efficiently detect a misbehaving node. Our key contributions are: 1) We are the first to quantify the impact of selfish CCA tuning via extensive experimentation on various 802.11 configurations. 2) We propose a lightweight scheme for detecting selfish nodes that inappropriately increase their CCAs. 3) We extensively evaluate our system on our testbed; its accuracy is 95 percent while the false positive rate is less than 5 percent.
    IEEE Transactions on Mobile Computing 07/2012; 11(7):1086-1101. DOI:10.1109/TMC.2011.131 · 2.91 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With their blistering expansions in recent years, popular on-line social sites such as Twitter, Facebook and Bebo, have become some of the major news sources as well as the most effective channels for viral marketing nowadays. However, alongside these promising features comes the threat of misinformation propagation which can lead to undesirable effects, such as the widespread panic in the general public due to faulty swine flu tweets on Twitter in 2009. Due to the huge magnitude of online social network (OSN) users and the highly clustered structures commonly observed in these kinds of networks, it poses a substantial challenge to efficiently contain viral spread of misinformation in large-scale social networks. In this paper, we focus on how to limit viral propagation of misinformation in OSNs. Particularly, we study a set of problems, namely the β1T -- Node Protectors, which aims to find the smallest set of highly influential nodes whose decontamination with good information helps to contain the viral spread of misinformation, initiated from the set I, to a desired ratio (1 − β) in T time steps. In this family set, we analyze and present solutions including inapproximability result, greedy algorithms that provide better lower bounds on the number of selected nodes, and a community-based heuristic method for the Node Protector problems. To verify our suggested solutions, we conduct experiments on real world traces including NetHEPT, NetHEPT_WC and Facebook networks. Empirical results indicate that our methods are among the best ones for hinting out those important nodes in comparison with other available methods.
    Proceedings of the 3rd Annual ACM Web Science Conference; 06/2012
  • Source
    Rui Zhang · Yanchao Zhang · Stella ) Sun · Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: Proximity-based mobile social networking (PMSN) refers to the social interaction among physically proximate mobile users directly through the Bluetooth/WiFi interfaces on their smartphones or other mobile devices. It becomes increasingly popular due to the recently explosive growth of smartphone users. Profile matching means two users comparing their personal profiles and is often the first step towards effective PMSN. It, however, conflicts with users' growing privacy concerns about disclosing their personal profiles to complete strangers before deciding to interact with them. This paper tackles this open challenge by designing a suite of novel fine-grained private matching protocols. Our protocols enable two users to perform profile matching without disclosing any information about their profiles beyond the comparison result. In contrast to existing coarse-grained private matching schemes for PMSN, our protocols allow finer differentiation between PMSN users and can support a wide range of matching metrics at different privacy levels. The security and communication/computation overhead of our protocols are thoroughly analyzed and evaluated via detailed simulations.
    Proceedings - IEEE INFOCOM 01/2012; DOI:10.1109/INFCOM.2012.6195574
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the parameters (knobs) of distribution-based anomaly detection methods, and how their tuning affects the quality of detection. Specifically, we analyze the popular entropy-based anomaly detection in detecting covert channels in Voice over IP (VoIP) traffic. There has been little effort in prior research to rigorously analyze how the knobs of anomaly detection methodology should be tuned. Such analysis is, however, critical before such methods can be deployed by a practitioner. We develop a probabilistic model to explain the effects of the tuning of the knobs on the rate of false positives and false negatives. We then study the observations produced by our model analytically as well as empirically. We examine the knobs of window length and detection threshold. Our results show how the knobs should be set for achieving high rate of detection, while maintaining a low rate of false positives. We also show how the throughput of the covert channel (the magnitude of the anomaly) affects the rate of detection, thereby allowing a practitioner to be aware of the capabilities of the methodology.
    01/2012; DOI:10.1109/HICSS.2012.456
  • Source
    Chrisil Arackaparambil · Guanhua Yan
    [Show abstract] [Hide abstract]
    ABSTRACT: Wikipedia has become a standard source of refer- ence online, and many people (some unknowingly) now trust this corpus of knowledge as an authority to fulfil their information requirements. In doing so they task the human contributors of Wikipedia with maintaining the accuracy of articles, a job that these contributors have been performing admirably. We study the problem of monitoring the Wikipedia corpus with the goal of automated, online anomaly detection. We present Wiki-watchdog, an efficient distribution-based methodology that monitors distributions of revision activity for changes. We show that using our methods it is possible to detect the activity of bots, flash events, and outages, as they occur. Our methods are proposed to support the monitoring of the contributors. They are useful to speed-up anomaly detection, and identify events that are hard to detect manually. We show the efficacy and the low false-positive rate of our methods by experiments on the revision history of Wikipedia. Our results show that distribution-based anomaly detection has a higher detection rate than traditional methods based on either volume or entropy alone. Unlike previous work on anomaly detection in information networks that worked with a static network graph, our methods consider the network as it evolves and monitors properties of the network for changes. Although our methodology is developed and evaluated on Wikipedia, we believe it is an effective generic anomaly detection framework in its own right.
    Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, Campus Scientifique de la Doua, Lyon, France, August 22-27, 2011; 08/2011
  • Guanhua Yan · Duc T. Ha · Stephan Eidenbenz
    [Show abstract] [Hide abstract]
    ABSTRACT: Botnets have emerged as one of the most severe cyber-threats in recent years. To evade detection and improve resistance against countermeasures, botnets have evolved from the first generation that relies on IRC chat channels to deliver commands to the current generation that uses highly resilient P2P (peer-to-peer) protocols to spread their C&C (Command and Control) information. On an encouraging note, the seminal work done by Holz et al. [14] showed that P2P botnets, although relieved from the single point of failure that IRC botnets suffer, can be easily disrupted using pollution-based mitigation schemes.For white-hat cyber-security practitioners to be better prepared for potentially destructive P2P botnets, it is necessary for them to understand the strategy space from the attacker’s perspective. Against this backdrop, we analyze a new type of P2P botnets, which we call AntBot, that aims to spread their C&C information to individual bots even though an adversary persistently pollutes keys used by seized bots to search the C&C information. The tree-like structure of AntBot, together with the randomness and redundancy in its design, renders it possible that individual bots, when captured, reveal only limited information. We mathematically analyze the performance of AntBot from the perspectives of reachability, resilience to pollution, and scalability. To evaluate the effectiveness of AntBot against pollution-based mitigation in a practical setting, we develop a distributed high-fidelity P2P botnet simulator that uses the actual implementation code of aMule, a popular Kademlia-based P2P client. The simulator offers us a tool to evaluate the attacker’s strategy in the cyber space without causing ethical or legal issues, which may result from real-world deployment. Using extensive simulation, we demonstrate that AntBot operates resiliently against pollution-based mitigation. We further suggest a few potential defense schemes that could effectively disrupt AntBot operations and also present challenges that researchers need to address when developing these techniques in practice.
    Computer Networks 06/2011; 55:1941-1956. DOI:10.1016/j.comnet.2011.02.006 · 1.28 Impact Factor

Publication Stats

594 Citations
20.97 Total Impact Points

Institutions

  • 2006–2014
    • Los Alamos National Laboratory
      • Computer, Computational, and Statistical Sciences Division
      Лос-Аламос, California, United States
  • 2012
    • The University of Tennessee Medical Center at Knoxville
      Knoxville, Tennessee, United States
  • 2004–2006
    • University of Illinois, Urbana-Champaign
      • Department of Electrical and Computer Engineering
      Urbana, Illinois, United States
  • 2003
    • Dartmouth College
      • Department of Computer Science
      Hanover, New Hampshire, United States