ThesisPDF Available

Robust Static Analysis of Portable Executable Malware

Authors:
  • GDATA CyberDefense

Abstract and Figures

The PE format is complex and robust PE parsing must consider the behaviour of all 32- and 64-bit Windows operating systems. It is unpredictable how many malformations are still unknown, which malformations will be possible with new Windows releases, and how they will affect analysis and antivirus software. Research in finding and documenting malformations must proceed as long as PE files are used. The present thesis contributes by raising awareness on possible consequences, describing solutions for robust parsing, providing a free and robust analysis library for public use, and turning anomalies from a disadvantage into an advantage for malware detection by using them as heuristic boosters or stoppers.
Content may be subject to copyright.
A preview of the PDF is not available
... While libraries exist to do this, their development is nontrivial and time intensive. Among currently existing tools such as pe le and PortEX, not all tools will even agree on the values stored in the PE-header [18], highlighting the di culty of the task. Furthermore, the Windows operating system does not always enforce its own speci cation [1,18], and that speci cation may be changed in the future, requiring additional work to update a system. ...
... Among currently existing tools such as pe le and PortEX, not all tools will even agree on the values stored in the PE-header [18], highlighting the di culty of the task. Furthermore, the Windows operating system does not always enforce its own speci cation [1,18], and that speci cation may be changed in the future, requiring additional work to update a system. These problems are only compounded by the fact that malware may intentionally violate the PE standard or Windows operating system (OS) loading process. ...
... By using these headers in particular, and not fully replicating the PE-Miner approach, we can ensure that all methods we evaluate here have equivalent information available for learning from. We use the PortEX library [18] to extract 115 features, of which 112 are numerical (such as the pointer to the Import table) and 3 are categorical (such as the intended runtime architecture), from the header of an executable. The PortEX library is speci cally designed to work with real malware headers, which do not always conform to the o cial speci cation [9]. ...
Conference Paper
Many efforts have been made to use various forms of domain knowledge in malware detection. Currently there exist two common approaches to malware detection without domain knowledge, namely byte n-grams and strings. In this work we explore the feasibility of applying neural networks to malware detection and feature learning. We do this by restricting ourselves to a minimal amount of domain knowledge in order to extract a portion of the Portable Executable (PE) header. By doing this we show that neural networks can learn from raw bytes without explicit feature construction, and perform even better than a domain knowledge approach that parses the PE header into explicit features.
... Since these features are manually crafted by security researchers, 31 and since most of existing malware instances are obfuscated using different 32 techniques, such as packing, encryption, etc., the extraction of such 33 characteristics can be a tedious task. Moreover, the high dimensionality that 34 characterizes the extracted features requires a feature selection (reduction) 35 phase that aims at removing irrelevant features, and which can be a labor- 36 intensive task as well. 37 In the last decade, researchers have shifted to deep learning in order 38 to overcome the aforementioned limitations of conventional machine 39 learning approaches. ...
... The authors introduced two baseline approaches as 523 well as a deep neural networks based one. The first baseline approach uses 524 PE metadata extracted using a third-party library called PortEX [34]. They 525 extracted 115 features, which were fed to two machine learning algorithms, 526 namely, random forest [14] and extra random trees [29]. ...
Chapter
Malwares, such as ransomware, Trojans, spyware, and botnets, are the most common cyber-threats that can cause significant damages for organizations, governments, and individuals. Thus, malware analysis and detection are of prevalent importance for security analysts in both industry and academia. Early signature-based and conventional machine learning-based solutions have shown their limits against the huge proliferation and sophistication of recent malware. To deal with this issue, cybersecurity researchers have shifted to deep learning in order to design more efficient malware detection solutions that can ensure detection of known and unknown malware as well as sophisticated ones. In this paper, we provide a comprehensive review of state-of-the-art deep learning-based malware analysis and detection solutions targeting the Microsoft Windows desktop platform, over the period of 2015–2022. We provide a detailed taxonomy that classifies these solutions according to various criteria including the analysis task, the nature of the extracted features, the used features representation method, and the used deep learning algorithms. Furthermore, we discuss these solutions with respect to the size and the nature of the testing dataset, the performance evaluation metrics for the different tasks, and the achieved results. Finally, we put the light on the current research challenges and recommend some promising future research directions.
... While libraries exist to do this, their development is non-trivial and time intensive. Furthermore, the Windows operating system does not always enforce its own specification [20,33], and that specification may be changed in the future, requiring additional work to update a system. These problems are only compounded by the fact that malware may intentionally violate the PE standard or Windows operating system (OS) loading process. ...
... By using these headers in particular, and not fully replicating the PE-Miner approach, we can ensure that all methods we evaluate here have equivalent information available for learning from. We use the PortEX library [20] to extract 115 features, of which 112 are numerical (such as the pointer to the Import table) and 3 are categorical (such as the intended runtime architecture), from the header of an executable. The PortEX library is specifically designed to work with real malware headers, which do not always conform to the official specification [10]. ...
Article
Many efforts have been made to use various forms of domain knowledge in malware detection. Currently there exist two common approaches to malware detection without domain knowledge, namely byte n-grams and strings. In this work we explore the feasibility of applying neural networks to malware detection and feature learning. We do this by restricting ourselves to a minimal amount of domain knowledge in order to extract a portion of the Portable Executable (PE) header. By doing this we show that neural networks can learn from raw bytes without explicit feature construction, and perform even better than a domain knowledge approach that parses the PE header into explicit features.
... used extensively by antivirus scanners for many years. They are often used to detect malware which belongs to the same family with different signatures [61]. Table 3 shows the ClamAV byte signature [62]. ...
Article
Full-text available
According to the recent studies, malicious software (malware) is increasing at an alarming rate, and some malware can hide in the system by using different obfuscation techniques. In order to protect computer systems and the Internet from the malware, the malware needs to be detected before it affects a large number of systems. Recently, there have been made several studies on malware detection approaches. However, the detection of malware still remains problematic. Signature-based and heuristic-based detection approaches are fast and efficient to detect known malware, but especially signature-based detection approach has failed to detect unknown malware. On the other hand, behavior-based, model checking-based, and cloud-based approaches perform well for unknown and complicated malware; and deep learning-based, mobile devices-based, and IoT-based approaches also emerge to detect some portion of known and unknown malware. However, no approach can detect all malware in the wild. This shows that to build an effective method to detect malware is a very challenging task, and there is a huge gap for new studies and methods. This paper presents a detailed review on malware detection approaches and recent detection methods which use these approaches. Paper goal is to help researchers to have a general idea of the malware detection approaches, pros and cons of each detection approach, and methods that are used in these approaches.
... This is the first and most simple technique used by specialists. [16]. • Strings. ...
... Rust binaries should be linked into a final Portable Executable (PE). PE file format is being used in Windows operating system and offers sectioning along with relocation [Hah14]. ...
Conference Paper
Full-text available
Rust, as being a systems programming language , offers memory safety with zero cost and without any runtime penalty unlike other languages like C, C++ or Cyclone. System programming languages are mainly used for low level tasks such as design of operating system components, web browsers, game engines and time critical missions like signal processing. Main disadvantages of the existing systems languages are being memory unsafe and having low level design. On the other hand, Rust offers high level language semantics, advanced standard library with modern skill set including most of the features and functional elements of widely-used programming languages. Moreover, Rust can be used as a scripting language like Python, and a functional language like Haskell or any other low level procedural language like C or C++, since Rust is both imperative and functional having no garbage collector. These design choices make Rust a suitable match for low level tasks via including high level scalability and maintainability. Meanwhile, EFI (Extensible Firmware Interface) specification is aimed to remove the limitations of legacy hardware. Hence, we present our analysis of utilizing Rust language on EFI-based bootloader design for x86 architecture, to make it useful for both practitioners and technology developers.
Chapter
Nowadays, cybercriminals become sophisticated and conducting advanced malware attacks on critical infrastructures, both, in the private and public sector. Therefore, it’s important to detect, respond and mitigate such threat to digital protection the cyber world. They leverage advanced malware techniques to bypass anti-virus software and being stealth while conducting malicious tasks. One of those techniques is called file-less malware in which malware authors abuse legitimate windows binaries to perform malicious tasks. Those binaries are called Living Off The Land Binaries (LOLBINS). That being said, during the execution of the attack it is not used any malicious executable and, consequently, the antivirus is unable to identify and prevent such threats. This paper focuses on defining rules to monitor the binaries used by threat actors in order to identify malicious behaviors.
Article
Full-text available
Portable executable or PE file features play a key role in detection of packed executables. Packing performs a lot of changes to the internal structure of PE files in such a way that it makes it very difficult for any Reverse Engineering Technique, Anti-Virus (AV) scanner or similar kind of programs to figure out whether the executable is malware or benign. Therefore, it is very important to figure out whether a given executable is packed or non-packed before detecting it as malicious or benign. Once a binary is detected as packed, it can be unpacked and can be given to AV or similar kind of programs. In this paper we have included a brief description of Portable Executable file format as we need to know the internal structure of PE before figuring out Packed Portable Executables. We have considered the packed executable by UPX packer only, and hence mentioned the functioning of UPX packer very briefly. Our approach basically works in two phases. In the first phase, it extracts various features of portable executables and in the second phase it analyses the extracted features and comes up with best set of features, which can be used to identify whether a given binary is packed or not by UPX Packer. Experimental results are shown to the end of this paper. We figure out the key feature set with proper justifications to show differences between packed and non-packed executable by UPX packer. Index Terms—Malware, non-packed, packed, portable executable.
Article
The course "Introduction to Computer Systems" at Carnegie Mellon University presents the underlying principles by which programs are executed on a computer. It provides broad coverage of processor operation, compilers, operating systems, and networking. Whereas most systems courses present material from the perspective of one who designs or implements part of the system, our course presents the view visible to application programmers. Students learn that, by understanding aspects of the underlying system, they can make their programs faster and more reliable. This approach provides immediate benefits for all computer science and engineering students and also prepares them for more advanced systems courses. We have taught our course for five semesters with enthusiastic responses by the students, the instructors, and the instructors of subsequent systems courses.
Article
We systematically describe two classes of evasion exploits against automated malware detectors. Chameleon attacks confuse the detectors' file-type inference heuristics, while werewolf attacks exploit discrepancies in format-specific file parsing between the detectors and actual operating systems and applications. These attacks do not rely on obfuscation, metamorphism, binary packing, or any other changes to malicious code. Because they enable even the simplest, easily detectable viruses to evade detection, we argue that file processing has become the weakest link of malware defense. Using a combination of manual analysis and black-box differential fuzzing, we discovered 45 new evasion exploits and tested them against 36 popular antivirus scanners, all of which proved vulnerable to various chameleon and werewolf attacks.
Article
Bell System Technical Journal, also pp. 623-656 (October)
Book
The 2nd edition adds material on the role of errors in scientific observation and a critical discussion of determinism from the standpoint of information theory to the material of the 1st edition, which applied information theory to a great number of problems of physics, including: the analysis of signals; thermodynamics; Brownian movement; thermal agitation in electronic tubes, rectifiers, etc.; entropy; Maxwell's demon; Szilard's well-informed heat engine; observations and error; communication; and computing. The new material on determinism leads to Brillouin's "matter of fact" point of view that strict determinism is impossible in scientific prediction because the high cost at some point makes increasing accuracy unattainable. The limit of accuracy is a practical rather than an inevitable limitation in the logical sense. The limitations can be formulated in precise ways by quantum conditions and information theory and should be included in the physical theory.