Blake Anderson’s research while affiliated with Los Alamos National Laboratory and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (13)


Two programs. Boxes represent subroutines of program, and arrows depict one subroutine calling another. The shaded region of subroutines for Program A and Program B represents a shared set of subroutines. Program A was correctly classified as malicious by an svm classifier based on overall 2-gram opcode similarity, while Program B was incorrectly classified as benign, despite Program A being in APT malware training sample. The subroutines in the shaded region were not matched in benign set of 4500+ programs
Cumulative subroutine match. The plot depicts three different APT 1 programs, giving the fraction of subroutines matched by at least one benign program as a function of the number of benign programs examined
Cutoff for R1(α)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_1^{(\alpha )}$$\end{document}. The figure plots the false positive rate as a function of the threshold used for R1(0.99)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_1^{(0.99)}$$\end{document} to classify a program as benign or malicious. If R1(0.99)/np≥\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_1^{(0.99)}/n_p\ge $$\end{document}’threshold’ then program is classified as malicious
ROC curves. The plot shows the ROC curves for four different classifiers in the out-of-family comparison. The dashed curve corresponds R1(.99)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_1^{(.99)}$$\end{document}, the dotted curve corresponds R1|M(.99)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_{1|M}^{(.99)}$$\end{document}, the dashed-dotted curve corresponds the svm classifier using Markov chain opcode representation, and the solid curve corresponds the classifier combining R1|M(.99)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_{1|M}^{(.99)}$$\end{document} with the svm classifier
Subroutine based detection of APT malware
  • Article
  • Publisher preview available

November 2016

·

222 Reads

·

31 Citations

Journal of Computer Virology and Hacking Techniques

·

Curtis Storlie

·

Blake Anderson

Statistical detection of mass malware has been shown to be highly successful. However, this type of malware is less interesting to cyber security officers of larger organizations, who are more concerned with detecting malware indicative of a targeted attack. Here we investigate the potential of statistically based approaches to detect such malware using a malware family associated with a large number of targeted network intrusions. Our approach is complementary to the bulk of statistical based malware classifiers, which are typically based on measures of overall similarity between executable files. One problem with this approach is that a malicious executable that shares some, but limited, functionality with known malware is likely to be misclassified as benign. Here a new approach to malware classification is introduced that classifies programs based on their similarity with known malware subroutines. It is illustrated that malware and benign programs can share a substantial amount of code, implying that classification should be based on malicious subroutines that occur infrequently, or not at all in benign programs. Various approaches to accomplishing this task are investigated, and a particularly simple approach appears the most effective. This approach simply computes the fraction of subroutines of a program that are similar to malware subroutines whose likes have not been found in a larger benign set. If this fraction exceeds around 1.5 %, the corresponding program can be classified as malicious at a 1 in 1000 false alarm rate. It is further shown that combining a local and overall similarity based approach can lead to considerably better prediction due to the relatively low correlation of their predictions.

View access options

A Study of Usability-aware Network Trace Anonymization

July 2015

·

2,031 Reads

·

15 Citations

The publication and sharing of network trace data is a critical to the advancement of collaborative research among various entities, both in government, private sector, and academia. However, due to the sensitive and confidential nature of the data involved, entities have to employ various anonymization techniques to meet legal requirements in compliance with confidentiality policies. Nevertheless, the very composition of network trace data makes it a challenge when applying anonymization techniques. On the other hand, basic application of microdata anonymization techniques on network traces is problematic and does not deliver the necessary data usability. Therefore, as a contribution, we point out some of the ongoing challenges in the network trace anonymization. We then suggest usability-aware anonymization heuristics by employing microdata privacy techniques while giving consideration to usability of the anonymized data. Our preliminary results show that with trade-offs, it might be possible to generate anonymized network traces with enhanced usability, on a case-by-case basis using micro-data anonymization techniques.


Integrating multiple data sources for malware classification

April 2015

·

11 Reads

Disclosed herein are representative embodiments of tools and techniques for classifying programs. According to one exemplary technique, at least one graph representation of at least one dynamic data source of at least one program is generated. Also, at least one graph representation of at least one static data source of the at least one program is generated. Additionally, at least using the at least one graph representation of the at least one dynamic data source and the at least one graph representation of the at least one static data source, the at least one program is classified.


Malware Detection Using Nonparametric Bayesian Clustering and Classification Techniques

November 2014

·

58 Reads

·

8 Citations

Technometrics

Computer security requires statistical methods to quickly and accurately flag malicious programs. This article proposes a nonparametric Bayesian approach for classifying programs as benign or malicious and simultaneously clustering malicious programs. The analysis is based on the dynamic trace of instructions under the first-order Markov assumption. Each row of the trace’s transition matrix is modeled using the Dirichlet process mixture (DPM) model. The DPM model clusters programs within each class (malicious or benign), and produces the posterior probability of being a malware which is used for classification. The novelty of the model is using this clustering algorithm to improve the classification accuracy. The simulation study shows that the DPM model outperforms the elastic net logistic (ENL) regression and the support vector machine (SVM) in classification performance under most of the scenarios, and also outperforms the spectral clustering method for grouping similar malware. In an analysis of real malicious and benign programs, the DPM model gives significantly better classification performance than the ENL model, and competitive results to the SVM. More importantly, the DPM model identifies clusters of programs during the classification procedure which is useful for reverse engineering.


Automating Reverse Engineering with Machine Learning Techniques

November 2014

·

2,775 Reads

·

7 Citations

Malware continues to be an ongoing threat, with millions of unique variants created every year. Unlike the majority of this malware, Advanced Persistent Threat (APT) malware is created to target a specific network or set of networks and has a precise objective, e.g. exfiltrating sensitive data. While 0-day malware detectors are a good start, they do not help the reverse engineers better understand the threats attacking their networks. Understanding the behavior of malware is often a time sensitive task, and can take anywhere between several hours to several weeks. Our goal is to automate the task of identifying the general function of the subroutines in the function call graph of the program to aid the reverse engineers. Two approaches to model the subroutine labels are investigated, a multiclass Gaussian process and a multiclass support vector machine. The output of these methods is the probability that the subroutine belongs to a certain class of functionality (e.g., file I/O, exploit, etc.). Promising initial results, illustrating the efficacy of this method, are presented on a sample of 201 subroutines taken from two malicious families.


Malware Phylogenetics Based on the Multiview Graphical Lasso

October 2014

·

32 Reads

·

11 Citations

Lecture Notes in Computer Science

Malware phylogenetics has gained a lot of traction over the past several years. More recently, researchers have begun looking at directed acyclic graphs (DAG) to model the evolutionary relationships between samples of malware. Phylogenetic graphs offer analysts a better understanding of how malware has evolved by clearly illustrating the lineage of a given family. In this paper, we present a novel algorithm based on graphical lasso. We extend graphical lasso to incorporate multiple views, both static and dynamic, of malware. For each program family, a convex combination of the views is found such that the objective function of graphical lasso is maximized. Learning the weights of each view on a per-family basis, as opposed to treating all views as an extended feature vector, is essential in the malware domain because different families employ different obfuscation strategies which limits the information of different views. We demonstrate results on three malicious families and two benign families where the ground truth is known.


Fig. 1. Markov chain transition probability representation of a dynamic instruction trace: (left) the first several lines from a dynamic trace output (i.e., instruction and location acted on in memory, which is not used), (right) a conceptual conversion of the instruction sequence into categorization 1 transition probabilities.  
Fig. 2. ROC curves for the methods in Table 2.
Fig. 4. Computation time breakdown for analysis of a particular new program that generated ∼3.5 × 10 6 instructions in a 5 minute trace. CIs calculated from a sample of 1000 posterior draws of P.
Stochastic identification of Malware with dynamic traces

April 2014

·

347 Reads

·

25 Citations

The Annals of Applied Statistics

Curtis Storlie

·

Blake Anderson

·

·

[...]

·

Nathan Brown

A novel approach to malware classification is introduced based on analysis of instruction traces that are collected dynamically from the program in question. The method has been implemented online in a sandbox environment (i.e., a security mechanism for separating running programs) at Los Alamos National Laboratory, and is intended for eventual host-based use, provided the issue of sampling the instructions executed by a given process without disruption to the user can be satisfactorily addressed. The procedure represents an instruction trace with a Markov chain structure in which the transition matrix, P\mathbf {P}, has rows modeled as Dirichlet vectors. The malware class (malicious or benign) is modeled using a flexible spline logistic regression model with variable selection on the elements of P\mathbf {P}, which are observed with error. The utility of the method is illustrated on a sample of traces from malware and nonmalware programs, and the results are compared to other leading detection schemes (both signature and classification based). This article also has supplementary materials available online.


Multiple Kernel Learning Clustering with an Application to Malware

December 2012

·

47 Reads

·

13 Citations

With the increasing prevalence of richer, more complex data sources, learning with multiple views is becoming more widespread. Multiple kernel learning (MKL) has been developed to address this problem, but in general, the solutions provided by traditional MKL are restricted to a classification objective function. In this work, we develop a novel multiple kernel learning algorithm that is based on a spectral clustering objective function which is able to find an optimal kernel weight vector for the clustering problem. We go on to show how this optimization problem can be cast as a semidefinite program and efficiently solved using off-the-shelf interior point methods.


Improving malware classification: Bridging the static/dynamic gap

October 2012

·

293 Reads

·

157 Citations

Malware classification systems have typically used some machine learning algorithm in conjunction with either static or dynamic features collected from the binary. Recently, more advanced malware has introduced mechanisms to avoid detection in these views by using obfuscation techniques to avoid static detection and execution-stalling techniques to avoid dynamic detection. In this paper we construct a classification framework that is able to incorporate both static and dynamic views into a unified framework in the hopes that, while a malicious executable can disguise itself in some views, disguising itself in every view while maintaining malicious intent will prove to be substantially more difficult. Our method uses kernels to place a similarity metric on each distinct view and then employs multiple kernel learning to find a weighted combination of the data sources which yields the best classification accuracy in a support vector machine classifier. Our approach opens up new avenues of malware research which will allow the research community to elegantly look at multiple facets of malware simultaneously, and which can easily be extended to integrate any new data sources that may become popular in the future.


Fig. 2 The left table shows an example of the trace data we collect. A hypothetical resulting graph representing a fragment of the Markov chain is shown on the right. In a real Markov chain graph, all of the out-going edges would sum to 1
Fig. 4 Classification accuracy of 50 instances of malware versus 10 instances of benign software as we vary the number of eigenvectors, k, of the spectral kernel. Results are averaged over 10 runs with the error bars being one standard deviation
Fig. 5 The heat maps of the kernel (similarity) matrix for benign software versus malware. The smaller block in the upper left of each figure is the benign software and the larger lower right is the malware. a Gaussian kernel, b Spectral kernel, c Combined kernel
Fig. 6 The heat maps of the kernel matrix for the Netbull virus with different packers versus malware. a Gaussian kernel, b Spectral kernel, c Combined kernel
results for the computation time for each step of our method. All results are in seconds with one standard deviation given
Graph-based malware detection using dynamic analysis

November 2011

·

6,976 Reads

·

298 Citations

Journal in Computer Virology

We introduce a novel malware detection algorithm based on the analysis of graphs constructed from dynamically collected instruction traces of the target executable. These graphs represent Markov chains, where the vertices are the instructions and the transition probabilities are estimated by the data contained in the trace. We use a combination of graph kernels to create a similarity matrix between the instruction trace graphs. The resulting graph kernel measures similarity between graphs on both local and global levels. Finally, the similarity matrix is sent to a support vector machine to perform classification. Our method is particularly appealing because we do not base our classifications on the raw n-gram data, but rather use our data representation to perform classification in graph space. We demonstrate the performance of our algorithm on two classification problems: benign software versus malware, and the Netbull virus with different packers versus other classes of viruses. Our results show a statistically significant improvement over signature-based and other machine learning-based detection methods.


Citations (10)


... Based on our experiment and study, we chose phylogenetic as a basic concept in forming a mobile malware classification. There are a number of existing works related to malware phylogenetic, namely [13][14][15][16][17][18][19][20]. ...

Reference:

Mobile Malware Classification based on Phylogenetics
Malware Phylogenetics Based on the Multiview Graphical Lasso
  • Citing Conference Paper
  • October 2014

Lecture Notes in Computer Science

... Ghiasi, 2015), using two-dimensional binary program features (Berlin, 2015), employing subroutine based detection (J. Sexton, 2016), employing statistics of assembly instructions (P. Khodamoradi Dynamic analysis is used in studies that synthesize the semantics of obfuscated code, multi-hypothesis testing (Perdisci, 2016), quantitative data flow graph metrics (T. ...

Subroutine based detection of APT malware

Journal of Computer Virology and Hacking Techniques

... When a new variant of malware is acquired, experts commonly analyze the sample manually because the knowledge about the functions of malware is necessary for its removal [21]. Such manual analysis requires several hours to several weeks, depending on the malware complexity [4]. One reason that malware analysis is so time-consuming is that it is not easy to identify the region in the binary data that characterizes the functionality of the malware. ...

Automating Reverse Engineering with Machine Learning Techniques
  • Citing Article
  • November 2014

... Like SOIM, the privacy knowledge area skims a wide surface area but does include obfuscationbased inference control as a technical confidentiality protection approach. Obfuscation measures appear to be the best path toward this project's data confidentiality and utility goals, an approach taken by many previous projects sharing network trace data for third party assessment (Mivule & Anderson, 2015;Mohammady et al., 2018;Pang et al., 2006;Xu, J. et al., 2001) and more recent research protecting personally identifiable information (PII) in other types of cyber event logs (DeYoung, 2018;Menges et al., 2021;Rasic, 2020). (Fung et al., 2010), released a survey of obfuscation-based research focused on data subject privacy preservation while retaining a high degree of utility within published datasets in addition to expanding on many techniques discussed in CyBOK. ...

A Study of Usability-aware Network Trace Anonymization

... Working directly on this approach, in Kao et al. (2015) the authors proposed a Bayesian nonparametric approach to modelling the probability transition matrices {P i } i by using a mixture of matrix Dirichlet distributions (MD) with a DP (θ, P 0 ) as mixing distribution, that is, a mixture of Dirichlet processes (MDP) (Antoniak, 1974). More specifically, the authors assumed a hierarchical model on the transition matrices, where P i |σ, Q i iid ∼ M D(σQ i ) with σ > 0 the concentration parameter and Q i the shape parameter. ...

Malware Detection Using Nonparametric Bayesian Clustering and Classification Techniques
  • Citing Article
  • November 2014

Technometrics

... Traditional learning-based approaches for malware analysis heavily rely on handcrafted feature extraction. Typically, these methods extracted the malware features manually and then utilized their handcrafted features to feed shallow classifiers like support vector machine (SVM), naive Bayes classifier, decision trees, k-nearest algorithms, etc. [37][38][39][40][41]. However, the effectiveness of these methods relies heavily on feature engineering. ...

Improving malware classification: Bridging the static/dynamic gap
  • Citing Conference Paper
  • October 2012

... On the other hand, dynamic methods are behavior-based approach that execute suspicious code in a controlled environment and monitors for any malicious behavior [6]- [10]. The dynamic method can detect new kinds of malware, malware variants, and obfuscated known malware. ...

Stochastic identification of Malware with dynamic traces

The Annals of Applied Statistics

... It is natural to extend existing single kernel clustering methods into multiple kernel scenario. The typical methods include K-means based [2,4,7,18,20,32,37,41], self-organizing map (SOM) [24], maximum margin clustering based [26,30,34], local learning-based [33], spectral clustering based [1,3,6,12,13,22,29] and subspace clustering based [8-10, 39, 40] algorithms. Compared with the single kernel counterpart, MKC should take special effort to handle the additional data problems such as noisy and incomplete kernels [2,15,19,21,28,36,38,40,41]. ...

Multiple Kernel Learning Clustering with an Application to Malware
  • Citing Conference Paper
  • December 2012

... For example, Gao et al. [27] demonstrated the effectiveness of graph convolutional networks in detecting android malware based on their API usage graphs. Anderson et al. [2] presents the only study we know of that uses instruction-level resource graphs to classify malicious software behavior. They use the adjacency of instructions in a dynamic execution trace to construct a Markov chain of assembly instructions, which are classified using graph kernels and machine learning. ...

Graph-based malware detection using dynamic analysis

Journal in Computer Virology

... Feature subsets vary at each level of segmentation, leading to a multilevel approach and thus optimal segmentation. 90 In another study, machine learning models were used to find markers correlated with ultrasonography-detected erosive arthritis. Features including joint and laboratory assessments, such as SLE-related autoantibodies, were used as inputs for supervised machine-learning algorithms, including decision tree and logistic regression. ...

An Automated Method for Segmenting White Matter Lesions through Multi-Level Morphometric Feature Classification with Application to Lupus