Conference Paper

An Efficient Piecewise Hashing Method for Computer Forensics

SouthWest JiaoTong Univ., Chengdu;
DOI: 10.1109/WKDD.2008.80 Conference: Knowledge Discovery and Data Mining, 2008. WKDD 2008. International Workshop on
Source: IEEE Xplore

ABSTRACT Hashing, a basic tool in computer forensics, is used to ensure data integrity and to identify known data objects efficiently. Unfortunately, intentional tiny modified file can not be identified using this traditional technique. Context triggered piecewise hashing separates a file into pieces using local context characteristic, and produces a hash sequence as a hash signature. The hash signature can be used to identify similar files with tiny modifications such as insertion, replacement and deletion. The algorithm of currently available scheme is designed for junk mail detection, which is low efficient and not suitable for file system investigation. In this paper, an improved algorithm based on the Store-Hash and Rehash idea is developed for context triggered piecewise hashing technique. Experiment results show that the performance of speed and the ability of similarity detection of the new scheme are better than that of spamsum. It is valuable for forensics practice.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A hash function is a well-known method in computer science to map arbitrary large data to bit strings of a fixed short length. This property is used in computer forensics to identify known files on base of their hash value. As of today, in a pre-step process hash values of files are generated and stored in a database; typically a cryptographic hash func-tion like MD5 or SHA-1 is used. Later the investigator computes hash values of files, which he finds on a storage medium, and performs look ups in his database. Due to security properties of cryptographic hash functions, they can not be used to identify similar files. Therefore Jesse Kornblum proposed a similarity preserving hash function to identify sim-ilar files. This paper discusses the efficiency of Kornblum's approach. We present some enhancements that increase the performance of his algo-rithm by 55% if applied to a real life scenario. Furthermore, we discuss some characteristics of a sample Windows XP system, which are relevant for the performance of Kornblum's approach.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Bytewise approximate matching is a relatively new area within digital forensics, but its importance is growing quickly as practitioners are looking for fast methods to screen and analyze the increasing amounts of data in forensic investigations. The essential idea is to complement the use of cryptographic hash functions to detect data objects with bytewise identical representation with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have been studied and tested for a long time, approximate matching ones are still in their early development stages and evaluation methodology is still evolving. Broadly, prior approaches have used either a human in the loop to manually evaluate the goodness of similarity matches on real world data, or controlled (pseudo-random) data to perform automated evaluation. This work's contribution is to introduce automated approximate matching evaluation on real data by relating approximate matching results to the longest common substring (LCS). Specifically, we introduce a computationally efficient LCS approximation and use it to obtain ground truth on the t5 set. Using the results, we evaluate three existing approximate matching schemes relative to LCS and analyze their performance.
    Digital Investigation 05/2014; · 0.99 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Finding similarities between byte sequences is a complex task and necessary in many areas of computer science, e.g., to identify malicious files or spam. Instead of comparing files against each other, one may apply a similarity preserving compression function (hash function) first and do the comparison for the hashes. Although we have different approaches, there is no clear definition / specification or needed properties of such algorithms available.
    Information Security for South Africa (ISSA), 2012; 01/2012