Conference PaperPDF Available

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Authors:

Abstract

Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric to define the similarity between two documents but has quadratic runtime which makes it impractical for large documents where large starts with a few hundred kilobytes. In this paper, we present a novel concept that allows estimating the Levenshtein Distance: the algorithm first compresses documents to signatures (similar to hash values) using a user-defined compression ratio. Signatures can then be compared against each other (some constrains apply) where the outcome is the estimated Levenshtein Distance. Our evaluation shows promising results in terms of runtime efficiency and accuracy. In addition, we introduce a significance score allowing examiners to set a threshold and identify related documents.
Identifying document similarity using a fast estimation of the
Levenshtein Distance based on compression and signatures
Peter Coatesa, Frank Breitingerb,
aKingswood Road, Weehawken, NJ, 07086, United States
bSchool of Criminal Justice, University of Lausanne, 1015 Lausanne, Switzerland
ARTICLE INFO
Keywords:
Levenshtein Distance
Edit Distance
Estimation
Document Similarity
Approximate String Matching
Fingerprint
Digest
ABSTRACT
Identifying document similarity has many applications, e.g., source code analysis or plagiarism
detection. However, identifying similarities is not trivial and can be time complex. For instance,
the Levenshtein Distance is a common metric to define the similarity between two documents but
has quadratic runtime which makes it impractical for large documents where large starts with a few
hundred kilobytes. In this paper, we present a novel concept that allows estimating the Levenshtein
Distance: the algorithm first compresses documents to signatures (similar to hash values) using a
user-defined compression ratio. Signatures can then be compared against each other (some constrains
apply) where the outcome is the estimated Levenshtein Distance. Our evaluation shows promising
results in terms of runtime efficiency and accuracy. In addition, we introduce a significance score
allowing examiners to set a threshold and identify related documents.
1. Introduction
A crucial task for digital forensic investigations is the
identification of relevant artifacts (evidence) where examin-
ers often rely on automation to cope with the sheer amounts
of data. One example is file classification, which is often
done using hash functions: compute the hash value of a file
and compare the hash against a whitelist (in case of a match
the file can be ignored) or blacklist (in case of a match the file
is regarded as potential evidence that requires further investi-
gation). As hash functions can only identify exact duplicates,
they have been complemented by approximate matching
(AM) a.k.a. similarity hashing or fuzzy hashing (Breitinger,
Guttman, McCarrin, Roussev, White et al.,2014a). Instead
of providing a binary decision, most AM algorithms provide
a certainty score indicating their confidence that two artifacts
are similar or related. For instance, on a scale of 0 to 100,
where 0 indicates no similarity, a score of 80 indicates that
the tool is confident that the two artifacts have similarity
where the definition of similarity is defined by the tool and
can be on various levels, e.g., byte-level similarity vs. vi-
sual similarities in case of images. While these algorithms
provide great value to practitioners, there are two problems:
First is the interpretation of the output as a score of 80 does
not mean there is a similarity of 80 percent. In fact, two
artifacts can have a score of 100 but their cryptographic
hashes are different. Second, AM algorithms work best for
larger artifacts but are less reliable for medium sized texts,
e.g., comparing spam Emails with each other to identify
spam campaigns. Which, depending on the length of the text,
Copyright remains with the authors.
Corresponding author.
Email addresses: coatespt@gmail.com (P. Coates);
frank.breitinger@unil.ch (F. Breitinger)
URL: https://hadoopoopadoop.com (P. Coates);
https://FBreitinger.de (F. Breitinger)
ORCID (s): 0000-0001-5261-4600 (F. Breitinger)
is best solved using approximate string matching algorithms
such as Levenshtein distance.
Levenshtein distance (LD). LD is precise and has an
intuitive meaning regarding the dissimilarity of two strings.
Given a pair of strings, the LD is the number of single-
character edits (i.e., insertions, deletions, or substitutions)
that are required to turn one string into the other. For exam-
ple, the LD of pat and mat is one (replace mwith p); the LD of
pats and mat is two (replace m with p and delete the s). The
LD of pairs of randomly chosen longer texts of equal length
tends to be narrowly distributed. Therefore, an observed
divergence of the LD of a pair of texts from the expected
mean can be interpreted mathematically and also has an
intuitive meaning to a human. Unfortunately, in practice,
LD is primarily relevant for short strings (e.g., comparing
two strings with 50’000 characters each requires already
3.5sec on a modern Laptop) because the time complexity
of the algorithm is quadratic1being proportional to the
product of the lengths of the two strings, i.e., (𝑂(|𝑎|×|𝑏|)).
Quadratic performance does not present a hard size limit, but
comparing larger strings, such as Web pages, long articles, or
books, becomes impractically slow, and faster computers are
of little help when confronted with a quadratic performance
curve.
Contribution. The heuristic described in this article offers
a partial solution to the problem by offering a way to estimate
the LD of texts pairs over a size range many hundreds of
times larger than is practical for calculating an exact value.
The algorithm works by first compressing the input (the
result is called signature) and then by applying the Leven-
shtein Distance to the signatures rather than to the original
1It has been shown that LD cannot be calculated in better than quadratic
time for the worst case, but it is possible to do better in special cases, for
instance, for strings that are known to be similar. For simplicity, we ignore
the special cases in this article.
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 1 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
texts. Note, as the heuristic described herein itself uses
LD, it suffers from the same quadratic limitation. However,
signatures can be typically many hundreds of times shorter
than the original texts and therefore the algorithm produces
estimates faster. For example: A size reduction factor of
𝐶results in a speedup of approximately 𝐶2. Viewed the
other way around, if a given computation time is deemed
acceptably fast for files of up to size 𝑋, the heuristic extends
the tolerable size to 𝐶𝑋 (where 𝐶is a parameter and can be
adjusted by a user).
Abbreviations. The following abbreviations are used
LD Levenshtein Distance
eLD estimated Levenshtein Distance
LCA Lossy Compression Algorithm
For simplicity, we will use 𝐿𝐷 and 𝑒𝐿𝐷 in the text as
well as during the algorithmic description. Thus, 𝐿𝐷(𝐴, 𝐵 )
means generating the Levenshtein distance between two
documents 𝐴and 𝐵. Note that we differentiate between
‘signature’ and ‘digest’: a digest is the return value of our
compression algorithm (similar to a hash value) where the
signature includes the ‘digest’ as well as other information
(e.g., filename and file size).
Structure. The remainder of the paper is organized as fol-
lows: first, we describe our algorithm in Sec. 2, which is the
core contribution of this article, followed by a brief summary
of the reference implementations in Sec. 3. Sec. 4.5 provides
an assessment of our algorithm as well as a discussion.
In Sec. 5we discuss some forensic applications from a
high-level perspective and introduce the significance score
allowing to filter for related documents. The last two sections
are the Related work and the Conclusions.
2. The Algorithm
We estimate (approximate) the LD of two documents
by (1) compressing each document into a signature using a
Lossy Compression Algorithm (LCA), (2) applying the LD
algorithm to the compressed signatures, and (3) scaling the
result back by the compression rate. As LD is well-defined,
this chapter discusses compression algorithm in Sec. 2.1, the
layout of the 𝑒𝐿𝐷 signature in Sec. 2.2 and the scaling /
estimation of the LD in Sec. 2.3.
2.1. Compression Algorithms
As outlined in the subsequent paragraph, conventional
compression algorithms such as LZ77 or BZIP2 are not
suited to this purpose (Sec. 2.1.1). Given these limitations,
we present properties necessary for our lossy compression
algorithm in Sec. 2.1.2 followed by the algorithmic descrip-
tion in Sec. 2.1.3.
2.1.1. Conventional Compression
One reason conventional compression algorithms are not
suitable is that they perform ‘lossless’ compression, which
greatly limits their compression rate. Lossless compression
is inherently limited by the amount of information present
in the data to be compressed. In information theory terms,
English text contains about 1.3 bits of information per 8-
bit character, which makes typical text about 0.1625 infor-
mation and 0.8375 ‘air’ that can in principle be squeezed
out without rendering the original irretrievable. This lim-
its text compression to a factor of approximately 6:1. A
second reason is that we want the compressed version to
be something like the thumbnail version of a photograph.
This means that despite being many times smaller (in size)
than the original, a signature must preserve some kind of
recognizable similarity to the original if direct comparison
of signatures is to provide a meaningful result. Ordinary
compression algorithms systematically eliminate this kind of
local similarity because maximum compression means that
the result looks uniformly random.
2.1.2. Properties for a lossy compression algorithm
When estimating the LD, we rarely if ever are concerned
with rates as low as the highest rates conventional com-
pression can achieve. A compression rate of 25:1 would
be quite small for an estimation application; compression
factors from 100 to the low thousands are more usual for text
and even larger compression factors may be used for binary
data such as videos.
For our approach, the compression function requires the
following properties (note: the term (lossy compression)
digest is used to describe the output of our compression
function):
Compression: The digest must be much smaller than
the original input where ‘much’ is a factor of 100 to
the low thousands.
Determinism: The digest must be identical for identi-
cal inputs.
Runtime efficiency: Ideally, the compression algo-
rithm runs in linear time with respect to the length of
the input. At least, it has to be fast.
Concatenation: The result of concatenating the digests
of two strings shall be identical to concatenating the
two strings first and then calculating the digest, i.e.,
𝑑𝑖𝑔(𝐴) + 𝑑𝑖𝑔(𝐵) = 𝑑𝑖𝑔(𝐴+𝐵).
Our proposed algorithm does not fully satisfy the fourth
property and therefore we settle for “good enough. That is,
if we concatenate strings first and then generate a digests
(𝑑𝑖𝑔(𝐴+𝐵)), the difference from the concatenated digests
(𝑑𝑖𝑔(𝐴) + 𝑑𝑖𝑔(𝐵)) must be limited to differences arising
from a bounded number of characters on either side of the
point of concatenation of the originals. As we will elaborate
on below, the number of characters around the point where
strings abut that can affect the output is determined by one
of the two key compression parameters.
2.1.3. Lossy Compression Algorithm (LCA)
Our LCA uses a rolling window of size-𝑁(neighbor-
hood) that slides through the text (character by character)
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 2 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
and generates a pseudo-random value (hash) at each position
(if the document has 𝐿characters, 𝐿𝑁+1 neighborhoods
are processed). Depending on a neighborhood’s hash, our
LCA either (a) does nothing, which is by far the most com-
mon decision or (b) uses the hashed value to select a single
character from a fixed set ‘replacing’ the neighborhood in
the digest. This keyhole view of the data ensures that any
substring of the input can have only a localized effect on the
output. The exact procedure is as follows:
LCA description. Let 𝑆be the input of length 𝐿, and
let 𝑆𝑝denote the current position 𝑝in 𝑆with 0𝑝
𝐿𝑁+1.𝐻𝑁(𝑆𝑝)denotes the hash function on a substring
of length 𝑁starting at 𝑝in 𝑆(details about 𝐻are discussed
in Sec. 2.1.4). Let 𝐴𝐿𝑃 𝐻 𝐴𝐵𝐸𝑇 be an array of unique
characters (e.g., [a...z,A...Z,0...9]). Then our LCA works
as follows (a visual description of the implementation is
provided in Fig. 1):
1. 𝑑𝑖𝑔 (lossy compression digest) denotes a string2con-
taining the digest, which is initially empty. The com-
pression rate 𝐶is an integer and discussed later.
2. If 𝑝+𝑁𝐿, print 𝑑𝑖𝑔 and quit (𝑝is initially zero).
3. Compute 𝐻𝑁(𝑆𝑝)and store the result in 𝑇.
4. If (𝑇 𝑚𝑜𝑑 𝐶) == 03,
(a) generate 𝑡𝑚𝑝 =𝑇 𝑚𝑜𝑑 𝑙𝑒𝑛(𝐴𝐿𝑃 𝐻 𝐴𝐵𝐸𝑇 ), and
(b) append 𝐴𝐿𝑃 𝐻𝐴𝐵𝐸 𝑇 [𝑡𝑚𝑝]to 𝑑𝑖𝑔.
5. Set 𝑝=𝑝+ 1 and return to step 2.
The three required parameters (𝐶, 𝑁 , 𝐴𝐿𝑃 𝐻 𝐴𝐵𝐸𝑇 ) im-
pact the algorithm as follows:
The nominal compression rate 𝐶: The larger 𝐶, the
less likely the if-statement (step 4) is triggered and
thus the smaller the digest will be. The choice of 𝐶is
subject to the following limitations: if 𝐶is too large,
too much information is lost and the digest provides
little value (or may even be an empty string); if 𝐶is too
small, digests will be bulky requiring more storage and
comparing digests using LD will be slow. The optimal
choice depends on the use case.
The neighborhood size 𝑁: Compression operates on
size-𝑁substrings of the input, one neighborhood
starting with each successive character. The larger
𝑁, the more sensitive the heuristic will be to small
differences because each character is part of the 𝑁
distinct neighborhoods (except within 𝑁characters
of the beginning or end of the input). For instance,
let us assume an input with 𝐿= 1000 and 𝑁=
500. Performing two changes, e.g., at positions 𝑝=
{300,700}, will likely change the resulting digest
completely. On the other hand, if 𝑁is too small,
the lack of diversity in the neighborhood substrings
results in an uneven distribution of output characters.
Common values for 𝑁range from 11 to 21.
2String was chosen for simplicity and readability;any other format such
as binary, hex or base64 would be possible and only requires minor changes.
3While we chose 0, any other arbitrarily chosen value between 0and
𝐶 1 will work.
The 𝐴𝐿𝑃 𝐻 𝐴𝐵𝐸𝑇 describes the characters used to
build the digests, e.g., printable ASCII characters
except newline, tab or space, and are chosen to be
conveniently readable for the user. The larger the
𝐴𝐿𝑃 𝐻 𝐴𝐵𝐸𝑇 , the less likely two different 𝑇’s are
mapped to the same character. Note, the length of
𝐴𝐿𝑃 𝐻 𝐴𝐵𝐸𝑇 has to be mutually prime to 𝐶, (for
simplicity one may use an 𝐴𝐿𝑃 𝐻𝐴𝐵 𝐸𝑇 of length
prime). If they are not mutually prime, different
strings will be mapped to the same output charac-
ter more likely. 𝐴𝐿𝑃 𝐻 𝐴𝐵𝐸 𝑇 cannot include ‘,’
(comma) as it is used as a separator in the final
signature.
Given the 𝐴𝐿𝑃 𝐻𝐴𝐵 𝐸𝑇 = [𝑎...𝑧, 𝐴...𝑍, 0...9], a digest
generated by the LCA may look like: AeVVgCAe2aZUpa6dnEkK...
Figure 1:
A simple visualization of our compression algorithm;
at
𝑝= 3
the sequence
LLO_WOR
is compressed to the letter `a'
(hash values are randomly chosen and do not match the actual
implementation).
H
E
L
L
O
_
W
O
R
L
D
!
𝑝 = 0; 𝑠𝑖𝑔 =„“; 𝑁 = 7; 𝐶 = 101; 𝐴𝐿𝑃𝐻𝐴𝐵𝐸𝑇 = {𝐴, 𝐶, 𝑎, 𝑏, 1,2, +}
𝑝 = 0 (iteration 0)
𝑇 = 𝐻!𝑆"𝑚𝑜𝑑 𝐶
𝐻!𝐻𝐸𝐿𝐿𝑂_𝑊 = 1069 𝑚𝑜𝑑 101 = 59
S =
𝑝 = 𝑝 + 1
𝑝 = 𝑝 + 1
𝑇 = 𝐻!𝑆#𝑚𝑜𝑑𝐶
𝐻!𝐸𝐿𝐿𝑂_𝑊𝑂 =715 𝑚𝑜𝑑 101 = 8
𝑇 = 𝐻!𝑆$𝑚𝑜𝑑 𝐶
𝐻!𝐿𝐿𝑂_𝑊𝑂𝑅 =303𝑚𝑜𝑑 101 = 0
𝑡𝑚𝑝 = 𝐴𝐿𝑃𝐻𝐴𝐵𝐸𝑇 303𝑚𝑜𝑑 7
𝑠𝑖𝑔 =𝑠𝑖𝑔 +𝑡𝑚𝑝
𝑝 = 𝑝 + 1
𝑇 = 𝐻 𝑆%𝑚𝑜𝑑 𝐶
𝐻!𝐿𝐿𝑂_𝑊𝑂𝑅 =801 𝑚𝑜𝑑 101 = 94
….
𝑝 = 𝑝 + 1
7 = 𝑙𝑒𝑛 𝐴𝐿𝑃𝐻𝐴𝐵𝐸𝑇
(𝑇 𝑚𝑜𝑑 𝐶) ! = 0
continue
s𝑖𝑔 =„𝑎“
2.1.4. Selection of Neighborhood Hash Function
As every 𝑁-size neighborhood is hashed resulting in
𝐿𝑁+ 1 calls of the hash function 𝐻, its runtime
efficiency is crucial. Naturally, a rolling hash function (e.g.,
Rabin Karb) or a non-cryptographic hash function (e.g.,
djb2) seem to be a good choice as they are performant. To
analyze the impact of 𝐻, several different hash functions
were tested and the results are summarized in Sec. 4.1. Our
final implementation uses Rabin Karb fingerprint but the
source code allows to easily replace this function. Note,
Rabin Karb is a non-cryptographic hash function and does
not have crypto properties such as preimage resistance or
collision resistance. However, the lossy compression ensures
that the preimage cannot be recovered but it is possible to
generate an input that produces a given digest.
2.2. eLD signature
The final 𝑒𝐿𝐷 signature comprises of a header as well
as the LCA digest (in the following we will use the term
digest instead of LCA digest). The header is necessary, as
only a comparison of signatures with identical parameters is
possible. Furthermore, additional information is required to
do the scaling as well as assessing the quality of the potential
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 3 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
match. The final 𝑒𝐿𝐷 signature includes the following val-
ues separated with a comma (* indicates mandatory values):
The path/filename of the original file.
The length of the original file*.
The 𝐶value used to generate the signature.
The 𝑁value used to generate the signature.
The length of the LCA digest
The LCA digest*.
Values that are not essential can be suppressed in the
output, e.g., if 𝐶and 𝑁are immutable they are not required
in the signature. A sample output is given in Fig. 2
Figure 2:
Sample
𝑒𝐿𝐷
signature output including all six
values; the signature has been shortened for better readability;
comment line (#) was added manually for readability and is
usually not printed.
$ go run *.go -in test-data/doc.txt
#filename, fileLength, C, N, digestLength, digest
test-data/doc.txt,13680,101,20,152,7n(n[(9&dU7;aRZaPgSGWzoFCC_r
{FUB;A]v?dGrL.bQZ!3GCT)V0r>XWpPNmQ6>#8YhTN]h-X598u+qw1Q1K:+&CVI
gIhCK//j2 ...
2.3. Estimating the LD
To estimate the LD of two documents given their 𝑒𝐿𝐷
signatures, we first compute the LD between the digests
and then perform a scaling where ideally 𝐿𝐷(𝐴, 𝐵)
𝑒𝐿𝐷(𝐴, 𝐵). Before explaining the estimation, it is essential
to realize that there are two scenarios:
1. The documents have roughly the same size and there-
fore the generated digests shall also have approxi-
mately the same size.
2. The documents have a large difference in size (e.g.,
ratio 1:5) which also results in a large difference of
the digests length.
In addition, we have to consider that two unrelated
documents still have a LD shorter than their length. Meaning
that, if 𝐴and 𝐵are two texts with 1000 characters each,
then the maximum 𝐿𝐷(𝐴, 𝐵)would be 1000 but usually it is
less as there is an overlap by chance. This phenomenon will
be called expected overlap 𝑅. As a consequence, it requires
more than only multiplying the result by the compression
ratio.
Expected Overlap Ratio 𝑅.The expected overlap ratio 𝑅
is mostly influenced by the underlying alphabet and the letter
frequency. For instance, the alphabet in an English book is
almost all 100 printable ASCII characters4but the distri-
bution and order is not random, e.g., there will be spaces
4Python $ len(string.printable)
between words, letter ‘e is most frequently used, ‘the’ is a
frequent word, etc. To obtain 𝑅, we programmed a python
script5that, given an alphabet, creates two test-strings (𝑇 𝑆)
of length 30’000 using randomly chosen alphabet elements
and calculates the LD. The three alphabets are:
random_chars: This alphabet consists of 83 elements
string.ascii_letters,string.digits and
()[]+#_-!?%<>@.:;&/{}.
eng_chars: For this alphabet we parsed a book, removed
newlines and generated one large string with 686’250
elements. As we randomly select one element at a
time, this reflects the frequency of letters in English
texts, i.e., the 𝑇 𝑆 has many spaces and ‘e but only a
few ‘!’ or ‘X’.
eng_words: Similar to before, we parsed a book but con-
verted it into a list containing words instead of one
large string. Thus, the alphabet is a list of 97’591
words. When creating 𝑇 𝑆, we added a space after each
word.
Example 𝑇 𝑆 for each alphabet are provided in Fig. 3. The
estimated overlap ratio 𝑅is then calculated as follows:
𝑅𝑎𝑙𝑝ℎ𝑎𝑏𝑒𝑡 = 1 𝐿𝐷(𝑇 𝑆1, 𝑇 𝑆2)
𝑙𝑒𝑛(𝑇 𝑆1)(1)
where 𝑙𝑒𝑛(𝑇 𝑆1) = 𝑙𝑒𝑛(𝑇 𝑆2) = 30000. For each 𝑅𝑎𝑙𝑝ℎ𝑎𝑏𝑒𝑡 ,
we performed 10 runs and averaged the ratio resulting in:
𝑅𝑟𝑎𝑛𝑑𝑜𝑚 = 0.0417,𝑅𝑐 ℎ𝑎𝑟𝑠 = 0.1593 and 𝑅𝑤𝑜𝑟𝑑𝑠 = 0.1902.
eLD scaling. Let 𝐴and 𝐵be two documents with 𝑙𝑒𝑛(𝐴)
𝑙𝑒𝑛(𝐵)(we expect the same for the digests, i.e., 𝑙𝑒𝑛(𝑑𝑖𝑔𝐴)
𝑙𝑒𝑛(𝑑𝑖𝑔𝐵)). Then the estimated LD is calculated as follows
(for readability | | replaces the 𝑙𝑒𝑛 function):
digDiff =|𝑑𝑖𝑔𝐴||𝑑 𝑖𝑔𝐵|
effectiveC = (|𝐴|+|𝐵|)∕(|𝑑𝑖𝑔𝐴|+|𝑑 𝑖𝑔𝐵|)
digLD =𝐿𝐷(𝑑𝑖𝑔𝐴, 𝑑 𝑖𝑔𝐵)
scaledDigLD =( digLD - digDiff ) * effectiveC∕(1 + 𝑅𝑤𝑜𝑟𝑑𝑠 )
fileLengthDiff =|𝐴||𝐵|
𝑒𝐿𝐷 =round( scaledDigLD + fileLengthDiff )
(2)
That means: 𝑒𝐿𝐷 consists of the difference in docu-
ment length fileLengthDiff as for every missing character,
one LD operation is neede (the document lengths (|𝐴|,
|𝐵|) are stored as part of the signatures). The second part
of the estimation, scaledDigLD, which has to be added to
fileLengthDiff, focuses on the portion of the digests that
do not match, i.e., 𝑑𝑖𝑔𝐴= 12𝐴𝐵𝐶𝐷 and 𝑑𝑖𝑔𝐵= 12𝐹 𝐸 ,
this focuses on the 𝐹 𝐸 (digLD - digDiff); the 12-part can
be ignored as the LD=0. The effective compression rate
effectiveC is used for scaling. Lastly, we know that there
is an expected overlap which we factor in by multiplying
(1 + 𝑅𝑤𝑜𝑟𝑑𝑠 ).
5getR.py in the helper-directory.
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 4 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
Figure 3:
Shortened examples of the test-strings used to calculate
𝑅
based on the dierent alphabets.
random_chars: 9SRCv8A{CKO&h{Ly0u7Nu)gw0(OE(0e/!AMm#ss-IGu&CIAoB@lCbwF#Rn:/(Br5GC.6U??iMe.]]t{/VQcd#jW0sTpS?+Xi8GKO0uNVf(6qX50f[2TA&z##X1.
eng_chars: stwbeulahf yafyes tagokku seuca niuaasdnnicollneptdCannags t gahoni hiyneeet ntdtngn eio;awgptmpao RrD eu1lc;ccdsLnPtrs7r il
eng_words: are Penn still Logan? the John of compensation Nicola sins Wilcox,along booklet on 9, Louis for spur trees, Caleb of in volume
Sample calculation. Let us assume two documents (700
and 500 bytes) that have the first 260 bytes in common. They
have the following signatures (𝐶and 𝑁are identical thus
signatures can be compared):
#filename,fileLength,C,N,digLength,digest
docA,700,51,20,15,AABBCFF00192192
docB,500,51,20,10,AABBCCDDEE
We will now follow Eq. 2and calculate the eLD:
1. digDiff = 15 - 10 = 5 (to get the length of the digests
we can either count or use the digLength-column
2. effectiveC = ( 700 + 500 ) / ( 15 + 10 ) = 48 (we
had a compression rate of 𝐶= 51 but the actual
compression is likely different so we compute the
actual compression rate).
3. digLD = (AABBCFF00192192, AABBCCDDEE) =
10 (replace CDDEE with FF001 and add 92192).
4. scaledDigLD = ( 10 - 5 ) * 48 = 240 (the five
characters CDDEE corresponds to half the signature
and should roughly cause an LD of 250); however
docA and docB have likely some overlap in this region
wherefore we reduce 250 / ( 1 + 0.19 ) = 201.68
5. fileLengthDiff = 700 - 500 = 200
6. eLD = round( 200 + 201.68 ) = 402
If we compare 𝑒𝐿𝐷 to our starting situation (260 bytes
in common), we know that the LD must be between 200
𝐿𝐷(𝑑𝑜𝑐 𝐴, 𝑑𝑜𝑐𝐵 )440 where 200 is impossible as this
would mean they have the first 500 bytes in common which
is a contradiction.
3. Reference Implementation(s)
There are three reference implementations of the algo-
rithm available - Java 11, Python 3.9 and GoLang 1.17 (thus
all are platform independent). As they were developed and
modified in parallel by different developers, they all work
slightly different but were used to validate results. In the fol-
lowing we will focus on the GoLang implementation; details
about the Java implementation can be found in the A; the
Python implementation turned out to be significantly slower
and will not be discussed further. All implementations are
open-source and available online6.
The zip file contains our go implementation in the eLD
folder, some test data, as well as an overview of the test
6The GoLang version can be downloaded here: https:
//www.fbreitinger.de/?page_id=218; the Java implementation is available
here: https://coatespt.github.io/Fast-Levenshtein- Distance/; for Python
please contact the authors as it is not maintained.
results. Options can be found in eLD/config/eLDconfig.go To
generate signatures for the files-folder run
$ go run *.go -in /path/to/dir > signatures.csv
and to compare all signatures within a signature file run
$ go run *.go -source signatures.csv
The help menu laying out the other options can be access
using the -h option.
Remark: As it is an early-stage prototype, error handling
and logging is currently limited. It may also be possible to
optimize code and make the implementations more efficient.
4. Evaluation
To assess our implementation, we ran experiments look-
ing into Signature generation runtime efficiency (Sec. 4.1),
Signature comparison runtime efficiency (Sec. 4.2), Esti-
mated LD vs. LD (unrelated/loosely related files) (Sec. 4.3),
and Estimated LD vs. LD (related files) (Sec. 4.4). The
last section is a Discussion of the evaluation (Sec. 4.5). All
experiments were conducted using the GoLang reference
implementation summarized in the previous section.
Device. Test results were obtained on a MacBook Pro
running BigSur version 11.06. The system used for testing
has an 8-Core Intel Core i9 @ 2.3 GHz with 64 GB Memory
and an Apple SSD AP1024N 1TB drive. During all tests, the
system was in use meaning that other processes like Browser,
Email client, etc. were running.
Dataset. As our implementation is geared towards text
files, we utilize the Gutenberg project and downloaded
18,216 files ranging from 10 KB to 12 MB totaling in
approximately 7.2 GB. Subsets were created as needed and
are explained in the corresponding section.
4.1. Signature generation runtime efficiency
The efficiency of generating signatures is dominated by
the speed of the hashing algorithm as it has to process all
neighborhoods. To test the performance of different hashing
algorithms, we created four versions of our tool utilizing
Rabin Karb, djb2, FNV and MD5. For each setting (𝑁=
{7,14,21}), we did three runs and averaged times. The time
command was used for measurement and we provide the
user+sys time which denotes the actual CPU time of the
process7:
$ time go run *.go -in ../allfiles/ > res.db
7$ go run will compile and run the program; compiling first using $ go
build and then running the application would be slightly faster.
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 5 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
Table 1
Average runtime for dierent hashing algorithms using three
dierent
𝑁
and a constant
𝐶= 301
.
N 7 14 21
Rabin Karb 42s 43s 42s
djb2 1m 55s 2m 35s 3m 18s
FNV* 2m 13s 2m 39s 3m 21s
MD5* 19m 18s 19m 41s 19m 39s
*Tests for FNV and MD5 utilized GoLang libraries
and required type-casts which may have impacted
runtime.
Table 2
Eciency for comparing the sample signatures for various
𝐶
(𝑁= 11)
; csv size is the le size of the signature le used for
comparison.
C duration csv size
51 2m 12s 342 KB
101 34s 175 KB
301 4.2s 59 KB
501 2.0s 37 KB
1001 0.9s 19 KB
The results are shown in Table 1; as a Benchmark we
also processed the dataset with shasum which required 14s.
As expected, 𝑁does have no impact on Rabin Karb
fingerprint but on other algorithms that process sequences
and not char-by-char. It also has no impact on MD5 as long as
𝑁is smaller than the MD5 block size of 512 bit. A constant
𝐶was utilized since it has little/no impact on the runtime: 𝐶
influences the digest length (we confirmed our assumption
by running some tests using different 𝐶). Lastly, we also ran
some tests with the Java as well as Python implementation
but both were significantly slower and will not be discussed
here, e.g., Java using the string.hashcode() required usually
around 5 minutes.
4.2. Signature comparison runtime efficiency
While 𝐶has no impact on the signature generation, it
influences the signature comparisons. Remember, the larger
𝐶, the smaller the final digest and the digest comparison uses
LD with 𝑂(𝐿2). For this test, we randomly selected 40 files
(between 1.2 MB and 22KB) from our dataset with a total of
17.4 MB and created a signature file. We then measured the
time for an all-against-all comparison (i.e., 780 comparisons
total) with various 𝐶’s. Times and signature file sizes are
summarized in Table 2.
4.3. Estimated LD vs. LD (unrelated/loosely
related files)
This section compares our estimated LD vs. the orig-
inal LD with respect to efficiency and accuracy for unre-
lated/loosely related files (as all documents originate from
the Gutenberg project, they share common header and footer
information which we did not remove for this test). We cre-
ated another subset using 20 random files from the original
dataset where each document was required to be between
20 KB and 40 KB ensuring that LD completes in a reason-
able time. In total, the 20 files had 627 KB. An overview
of the experiment results is given Table 3where the LD-
column shows the original LD followed by various 𝐶’s. Our
test ignored self-comparisons, i.e., docA is only compared
against other documents and not itself, resulting in a total of
190 comparisons.
To determine the avg error (row 2), we summed all
absolute values of our estimate LD minus the LD and divided
it by 190. In addition, we provide the min error, max error
as well as the standard deviation. This raw error does not
consider the ratio of the LD to the file length which is nec-
essary to interpret the meaning of the results. For instance,
let us assume pairs documents (size 32’000) having LD 1,
1000 and 15’000 and their eLD are 2, 2000 and 30’000. In
all cases the relative error ( 𝑒𝐿𝐷𝐿𝐷
𝑒𝐿𝐷 ) is 50% but the impact
is different: LD 1 / eLD 2 both indicate a strong correlation
between the files while LD 15’000 indicates correlation but
eLD 30’000 indicates unrelated documents. Therefore, we
also put the error rates in perspective (we use the error rate
as defined by MacKenzie and Soukoreff (2002, Eq. 1) but we
then deduct the rates to obtain a single value):
𝐸𝑅 =𝑎𝑏𝑠(𝐿𝐷(𝐴, 𝐵 )
𝑚𝑎𝑥(|𝐴|,|𝐵|)𝑒𝐿𝐷(𝐴, 𝐵)
𝑚𝑎𝑥(|𝐴|,|𝐵|)).(3)
Consequently, if LD and eLD have similar error rates (which
is desired), 𝐸𝑅 is low. For our example, this would result in
the following error rates: (𝑎𝑏𝑠(1
32000 2
32000 ) =) 0.00, 0.03
and 0.47 , respectively.
As shown by our results, our algorithm is significantly
faster and results have a low ‘ER’. The larger 𝐶the less
accurate is the algorithm which may also be due to the rather
small sample files (e.g., the average digest length is 158).
Detailed results are also included in the downloadable zip
file.
4.4. Estimated LD vs. LD (related files)
For this second test we reused the 20-random-file dataset
from the previous section, created a modified copy of the file
and then compared the original version against the modified
version. A summary of the outcome is provided in Table 4.
The proposed estimation works well for deletions as
long as they are not spread out across the complete doc-
ument (compare L1-12 vs. L13). Similarity, random inser-
tions (L14+15) have a similar impact on the performance
although if there are only a few insertions and a higher com-
pression rate, this may be equalized. Swapping (L16+17) is
also handled well where it is eye-catching that the estimated
distances are lower than the original distance. Clearly, our
algorithm has difficulties if there are many minor changes
throughout the document (L18-20). However, these may be
reduced by some additional pre-processing such as trimming
/ stripping or making it case insensitive.
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 6 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
Table 3
𝐿𝐷
-column shows the time for a traditional LD implementation from GoLangPrograms.com; the remaining columns outline
𝑒𝐿𝐷
results for various compression rates
𝐶
where
𝑁= 11
and
EXPECTED_OVERLAP_TEXT=0.19
remained unchanged. Columns provide the
absolute value and the percentage value (rounded to one digit).
𝐿𝐷 𝐶 = 11 𝐶= 21 𝐶= 51 𝐶= 101 𝐶= 201
Duration 9min 09s 3.5s 1.2s 0.6s 0.4s 0.3s
Avg Error (abs/%) 1223 / 6.5% 1166 / 6.4% 1519 / 9.0% 1578 / 9.0% 1650 / 9.4%
Min Error (abs/%) 16 / 0.1% 10 / 0.1% 10 / 0.1% 3 / 0.0% 16 / 0.1%
Max Error (abs/%) 3498 / 23.2% 3484 / 23.1% 5177 / 34.3% 4187 / 40.7% 6942 / 35.6%
Std Dev. (abs/%) 822 / 3.6% 767 / 3.8% 1114 / 6.4% 1075 / 7.0% 1424 / 7.3%
Error rate (avg/std dev) 0.03 / 0.02 0.03 / 0.02 0.04 / 0.03 0.04 / 0.02 0.05 / 0.04
Table 4
Results of comparing one document against a modied version of itself manipulated according to the operation-
column(modications were done manually and randomly). FS and LD represent the original le size and the Levenshtein distance.
The remaining columns show the
𝑒𝐿𝐷
for various
𝐶
(again
𝑁= 11
and
EXPECTED_OVERLAP_TEXT=0.19
remained unchanged).
#
Operation org. FS LD C=11 C=21 C=51 C=101
1 10348.txt 10348mod.txt Deleted 10 lines 24226 663 718 798 709 663
2 ltplt10.txt ltplt10mod.txt Deleted 10 lines 22869 634 672 652 717 719
3 7563.txt 7563mod.txt Deleted 51 lines in the beginning 23414 1024 1024 1024 1024 1024
4 7565.txt 7565mod.txt Deleted 101 lines in the middle 25983 2234 2234 2234 2234 2330
5 7906.txt 7906mod.txt Deleted 90 lines (some beginning; some end) 21586 3329 3329 3329 3329 3329
6 10630.txt 10630mod.txt Deleted 7 chunks 32185 7085 7095 7103 7086 7086
7 17282.txt 17282mod.txt Deleted 3 large chunks 25534 7436 7436 7436 7436 7436
8 18232.txt 18232mod.txt Deleted 15 times 3 lines 31635 2964 3117 3022 3034 3165
9 8528.txt 8528mod.txt Deleted approx rst half 33625 15811 15811 15811 15811 15811
10 11592.txt 11592mod.txt Deleted 10 paragraphs 36253 4224 4262 4259 4224 4224
11 16637.txt 16637mod.txt Deleted 5 paragraphs at the beginning 39864 1731 1740 1731 1731 1731
12 wlett10.txt wlett10mod.txt Deleted big middle chunk 32346 8968 8968 8968 8968 8968
13 lf17w10.txt lf17w10mod.txt Deleted all (450) `the' 38902 1215 5409 5749 7128 7156
14 haw4610.txt haw4610mod.txt Inserted 10 random `A' 31340 10 93 113 96 10
15 haw7810.txt haw7810mod.txt Inserted 40 random `A' 36281 40 531 562 819 390
16 lf20w10.txt lf20w10mod.txt Swapped 6 paragraphs around 37765 3218 2831 2969 2897 1559
17 14814.txt 14814mod.txt Swapped rst and second half (approx) 28050 17517 14128 14454 15134 17721
18 12337.txt 12337mod.txt All `b' (338) replaced with `B' 31241 338 4204 3172 3004 3248
19 17195.txt 17195mod.txt All `e' (3080) replaced with `E' 35158 3080 21629 21605 22510 20106
20 13322.txt 13322mod.txt All spaces (6572) replaced with double-space 38838 6572 34083 35883 31128 32218
4.5. Discussion of the evaluation
The initial assessment primarily focused on the 𝐶-
parameter and runtime; more testing is required to as-
sess the influence of other parameters such as 𝑁or 𝑅
(EXPECTED_OVERLAP_TEXT). However, this preliminary evalua-
tion shows promising results and probably can be optimized
by adjusting other parameters. Currently, our approach
performs well as long as the files are of sufficient size to
generate adequate signatures for small files one can fall
back to the original LD.
Estimating signatures has both advantages and limita-
tions: Most obviously, if the application at hand requires a
precise LD value, the proposed heuristic is not useful. For
instance, the heuristic will often estimate an LD of zero for
texts that in fact vary slightly. It will never, however, do
the opposite, i.e., estimate that two documents have an LD
greater than zero when in fact they are identical.
Algorithmic performance depends upon specific signature-
generation parameters as well as the text itself but estimates
are typically within a few percent of the LD and tend to
be more accurate with relatively similar documents, which
is often the most useful category. Meaning that there are
more efficient ways to identify very/completely different
documents such as approximate matching (see Sec. 6). The
level of accuracy is sufficient for tasks like detecting near-
duplication and partial duplication, estimating how much
two documents vary, filtering through large numbers of
documents to look for a near-match to some substantial block
of text, and detecting documents that are related in more
complex ways.
The effectiveness also depends upon variability in the in-
put; there can be inputs that produce pathological results. For
instance, the same short (relative to 𝑁) character sequence
endlessly repeated may result in a ‘zero length output or in
an excessive output because one or more of the 𝑁-length
substrings that are repetitively encountered causes an output
character to be generated. While repetitive input of any great
length would be unusual in natural language text, it can
show up in various kinds of machine-generated text such as
CSV files, log files, experimental data, etc. From a forensic
perspective, signatures for such data are unlikely to be useful
(i.e., one would probably not compare them against other
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 7 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
documents). If excessively short or empty, they would fail to
match even nearly identical files, and if they were excessively
long, they would result in excessive processing time. For
realistic 𝑁and 𝐶, neither of these results will be close to
the expected output size of 1
𝐶of the input size and thus can
be easily identified.
The signature of a fragment of a document is identical
with the corresponding portion of the signature of a larger
document in which it is embedded (except possibly near
the points where they abut.) Therefore, signatures of related
documents differ in locations that are in approximately the
same relative positions as the corresponding differences in
the originals. This means, not only can signatures be used to
estimate how much two documents differ, they can also be
used to infer useful information about where and how they
differ. For example, they can indicate that two documents
differ mostly at the beginning and end but differ little else-
where, or that many differences are sprinkled throughout, or
that one document appears to be appended to the other.
One of the most useful aspects of the signature-based
approach is that neither the target documents nor the search
documents need be in hand at the time an estimate is made.
This has the obvious benefit of allowing the cost of signature
generation to be amortized over many uses, but it also
means that it is possible to search a collection of signatures
for similar documents and assess their degree of textual
similarity of candidates without having direct access to any
of the documents involved.
5. Digital forensic application
While many files on computers are binary, there is still
plenty of text based information that must be assessed during
an investigation which is the application of our prototype
implementation. Having an efficient implementation of the
Levenshtein distance allows the algorithm not only to be
applied on short sequences but also larger files such as
complete documents or source code files to cluster files,
detect plagiarism or utilize it for black-/whitelisting. In the
following we outline four scenarios where our algorithm can
contribute:
Source code analysis: During an investigation it may be
necessary to analyze and compare large software
projects in order to identify if there are similarities
in the source code. Identified similarities could mean
that source code was copied (stolen) and reused. On
the other hand, this also allows clustering similar files
together. An example here would be the analysis of
malicious code where an investigator received access
to the source code of thousands of malware samples
and wants to cluster similar samples together reducing
the time of the manual analysis (e.g., we assume that
similar malware applications have a similar behavior
and only one sample per family, i.e., cluster, needs to
be analyzed).
Fuzzy string matching: Instead of looking for exact matches,
it may be necessary to identify strings that roughly
match a given string or are a substring of a given
string. For instance, let us assume an examiner is given
a Spam-Email and has to proof that the suspect is
responsible for an Email spam campaign wherefore
s/he decides to search in all documents for the given
Email. It is likely that the Email has been modified
throughout the campaign to avoid detection by spam
filters.
Similarities between cybercrimes: Besides processing ev-
idence, our approach may also be useful to identify
similarities between cybercrimes as discussed by
Bollé and Casey (2018). In their work, the authors
propose “new approach to finding linkages and rep-
etitions across cases in a cyber-investigation context
using near similarity calculation of distinctive digital
traces”. In their study, the authors use the Levenshtein
distance on email addresses. Using a more performant
algorithm allows to not only analyze email addresses
but to cross compare additional (more comprehensive)
traces.
Anonymous searches: As little useful information about
the contents of documents can be divined from the
signatures, one can have a service that allows an
outside party to check whether partial matches to a
document or other files are present without the service
having access to the library to be searched or to the
text to be searched for. The client does not have to
provide the query and the service only has access to
the signatures, not the originals.
Note, while the current prototype has been developed for
text files only, we plan to expand it and see if it can be used
for binary files as well.
5.1. Significance score
The estimated LD is computed for each pair of sig-
natures, but in most cases, an estimate is not informative
without context. To assess the importance of a comparison
and filter out pairs that are not in a relationship (which is
hard to do from the estimated LD score), we introduce the
significance 𝛿8. This value puts the 𝑒𝐿𝐷 in relation to the
digest lengths and allows an examiner to set a threshold 𝑇
(0.0𝑇1.0) so that only matches with 𝛿𝑇are shown.
Note, 𝑇= 0.0will return all matches and 𝑇= 1.0will only
return ‘exact’ matches with 𝑒𝐿𝐷 = 0.
As discussed in Sec. 2.3, there are two different scenarios
where digests have a similar length or may vary widely.
Ideally, 𝛿considers both cases. That is, 𝑑𝑒𝑙𝑡𝑎 shall be high
if a smaller digest is (almost) included in a larger sequence
as well as it shall be large if there is a low LD between two
similar sized digests. Currently, the significance score 𝛿for
two digests is computed as follows (we assume 𝑙𝑒𝑛(𝑑𝑖𝑔𝐴)
𝑙𝑒𝑛(𝑑𝑖𝑔𝐵)):
8This is work in progress and as will be seen the current score has some
weaknesses requiring a workaround.
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 8 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
Table 5
Sample
𝛿
values for various digest length and LDs.
𝑙𝑒𝑛(𝑑𝑖𝑔𝐴)𝑙𝑒𝑛(𝑑𝑖𝑔𝐵)𝐿𝐷(𝑑𝑖𝑔𝐴, 𝑑 𝑖𝑔𝐵)𝑑𝑒𝑙𝑡𝑎
1 700 700 0 1.000
2 700 700 10 0.986
3 700 350 400 0.857
4 700 100 600 1.000
5 700 700 600 0.143
6 700 350 650 0.143
7 700 100 696 0.040
8 700 200 700 0.000
9 70'000 700 70'000 0.000
10 70'000 700 69'650 0.500
11 70'000 700 69'300 1.000
𝑙𝑑 =𝐿𝐷(𝑑𝑖𝑔𝐴, 𝑑𝑖𝑔𝐵)Levenshtein distance of digests
𝛿=𝑙𝑒𝑛(𝑑𝑖𝑔𝐴) 𝑙𝑑
𝑙𝑒𝑛(𝑑𝑖𝑔𝐵)
However, this calculation has the weakness that it does
not work for very large differences in digest size (e.g., ratios
of 1 to 20). Table 5provides some sample 𝑑𝑒𝑙𝑡𝑎 values for
given digests length and overlap where the top third contains
‘good’ matches, the middle third contains ‘poor matches’
and the last third demonstrates where 𝛿fails. The problem is,
if the difference is too large, 𝑑𝑖𝑔𝐴can be converted to 𝑑𝑖𝑔𝐵
just by deleting characters resulting in 𝐿𝐷(𝑑𝑖𝑔𝐴, 𝑑𝑖𝑔𝐵) =
𝑙𝑒𝑛(𝑑𝑖𝑔𝐴) 𝑙𝑒𝑛(𝑑𝑖𝑔𝐵).
There are several possible solutions: (i) define a mul-
tiplier so that both digest lengths have to be in this range
from each other (e.g., factor 10); (ii) adjust the equation to
consider this behavior (e.g., using 𝑅𝑅𝑎𝑛𝑑𝑜𝑚); or (iii) require
that sequences share a common substring of a certain length
(e.g., if 𝑑𝑖𝑔𝐵is a substring of 𝑑 𝑖𝑔𝐴, they are likely related).
5.2. Assessment significance score (𝛿)
To assess 𝛿s raison d’être, we performed two test runs:
(1) do an all-against-all comparison of 72 unrelated files and
(2) finding 10 originals in the full dataset.
Test1. To create the 72-unrelated-files dataset, we used the
large set, randomly selected 72 files and shrank them to
30 KB by removing text from the beginning and the end
(i.e., deleting potential header/footer information). We then
calculated the LD between all documents and sorted the
2556-pairs according to their LD ranging from to 22’255 to
25’762. Lastly, we selected some pairs with a low distance
and analyzed them manually to ensure they are different.
While further testing is required, we believe these still sig-
nificant overlaps originate from the many leading spaces in
the documents (see example in Fig. 4)9.
Next, we ran our algorithm (𝐶= 51,𝑁= 11)
and computed the significance score 𝛿for each pair. The
9For one pair, we removed the double-space which increased the LD
by about 2200.
Figure 4:
Extract from document 13649.txt showing the
leading spaces which cause relatively low LD's.
Of the Hills of the Chankly Bore,--
Then, through the vast and gloomy dark
There moves what seems a fiery spark,--
A lonely spark with silvery rays
Piercing the coal-black night,--
A Meteor strange and bright:
Hither and thither the vision strays,
A single lurid light.
Table 6
Summary of the
𝛿
values for unrelated les. Right-half of
the table shows accumulated results where all matches where
almost all matches where lower than 0.12.
Min Max Avg
0.03 0.06 0.09 0.12 0.15
0.025 0.122 0.058 19 1470 2479 2555 2556
Table 7
Distribution of the
𝛿
score from comparing 10 documents
against the dataset (18'196).
𝛿0.1 0.1< 𝛿 0.2 0.2< 𝛿 0.3 0.3< 𝛿 0.4 0.4< 𝛿 0.5
93'359 9357 18'193 25'743 24'298
0.5< 𝛿 0.6 0.6< 𝛿 0.7 0.7< 𝛿 0.8 0.8< 𝛿 0.9 0.9< 𝛿
10'998 2 0 0 10
summarized results are shown in Table 6. As expected, we
obtain low scores for all matches where only one match is
above 0.12. To conclude, 𝛿works for dissimilar documents
that are of same/similar size.
Test 2. Our second test focused on identifying 10 of these
unrelated documents in the whole dataset. Therefore, we
created two signature files and used the source-destination
mode of our application. As the dataset is rather large, we
set 𝐶= 101 (𝑁remained 11). Additionally, we manipulated
the application to only consider files that are up to 10 times
larger than the source file (if larger, 𝛿is set to zero). The
distribution of the 𝛿score is shown in Table 7where the top
rows indicate the 𝛿-score and the bottom rows the absolute
number of pairs.
As indicated by the table, the 10 matches were clearly
identified with a gap to any other matches. We manually
investigated the two matches for 0.6< 𝛿 0.7and did not
see any similarity. However, both files were close to 300 KB
which is maximum difference in size to be considered and
again highlights the weakness of the current 𝛿-formula.
6. Related work
Given the sheer amount of work that has been done
in terms of string similarity and approximation, we only
included a summary where we decided to split the section
into two domains: string similarity / approximate string
matching and approximate matching (as defined from the
digital forensics community).
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 9 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
String similarity. String similarity metrics have a long
history and may different techniques exist. Wu (2020) pro-
poses to divide them in four categories: (1) character-based /
edit distance (examples are Hamming Distance, Levenshtein
Distance, Damerau-Levenshtein Distance or Jaro Distance);
(2) Sequence-based (examples are Longest Common Sub-
seqeunce, Longest Common Substring, or Geshtalt Pattern
Matching); (3) Token-based (examples are Q-Grams, Over-
lap Coefficient or Jakkard Similarity); and (4) Phonetic-
sensitive (examples are Smith-Waterman or Editex). Our
work falls into the first category, which is also known as
the string-to-string correction problem (Wagner and Fis-
cher,1974). Many of these techniques have been analyzed
in detail, optimized algorithms have been published and
improvements have been presented. A thorough analysis of
work related to the Levenshtein Distance has been done by
Navarro (2001) who compared the performance of different
algorithms doing a number of experiments. However, most
works focus on improving and not on estimating. For in-
stance, Ukkonen (1985) proposes an improvement that runs
in 𝑂(𝑠𝑚𝑖𝑛(𝑙𝑒𝑛(𝐴), 𝑙𝑒𝑛(𝐵)) where 𝑠is the edit distance
of the two strings compared. Thus, the algorithm runs in
approximately linear runtime for highly similar strings.
Approximate matching (AM). From a digital forensics
perspective, these algorithms are interesting but despite their
improvements, too slow for practical usage on large amounts
of data. Therefore, the community invented its own AM
(a.k.a. fuzzy hashing for similarity hashing) which later has
been summarized by the National Institute of Standards
and Technologies in SP800-168 with the goal to “to pro-
vide a definition and terminology to describe approximate
matching in order to promote discussion, research, tool
development and tool acquisition (Breitinger et al.,2014a).
The research in this domain evolves around developing
novel algorithms (e.g., Kornblum (2006); Oliver, Cheng and
Chen (2013); Chang, Ghosh, Sanadhya, Singh and White
(2019)), comparing algorithms (e.g., Roussev (2011)) or im-
proving/assessing existing implementations (e.g., Baier and
Breitinger (2011)). Discussing all existing works is beyond
the scope of this article but a somewhat up to date summary
is provided by Harichandran, Breitinger and Baggili (2016)
as well as Martín-Pérez, Rodríguez and Breitinger (2021)
who published a categorization of AM algorithms including
a discussion of various algorithms. A point that is raised
in many publications is the similarity score (i.e., the result
when comparing two fingerprints) which is only an indicator
of similarity but the similarity is not clearly defined, i.e.,
a similarity score of 50 does not meant that two inputs
are 50 percent identical or that 50 paragraphs are identical.
While this indicator (two inputs are similar with a high
certainty) is sufficient for some use cases, other scenarios
may require a more precise result. One example would be
checking for plagiarisms as often done by university during
online submissions.
Breitinger, Ziroff, Lange and Baier (2014b) published
an application named saHash (statistical analysis hashing)
which is based on four sub-hash functions, each creating
its own sub-hash value. The final, fixed-length hash is then
the concatenation of all sub-hashes. While the authors claim
that their implementation is almost as fast as SHA-1, it only
“enables the detection of ‘small’ changes between files up
to several hundred Levenshtein operations”.
7. Conclusions
The Levenshtein Distance (LD) is a well-known distance
metric to describe the dissimilarity between two strings
/ documents. While it works well for short sequences, it
requires a quadratic runtime and thus is impractical for long
sequences (e.g., hundreds of kilobytes). In this article we
presented a heuristic that allows estimating the LD also for
larger sequences. The algorithm works by first compressing
a given document into a signature. and then comparing sig-
natures against each other to estimate the LD. Consequently,
it is not necessary to possess the original documents to
calculate their LD. To assess the quality of our algorithm, we
performed several experiments and show that our estimation
is significantly faster and, depending on the modifications
done to a source element, obtains accurate estimations (e.g.,
the experiments showed that estimations are very accurate
for dissimilar documents of the same size as well as that
‘deletions’ are handled well.
Future Research. While the effectiveness of the heuristic
seems clear, there are several areas that need further inves-
tigation, especially the following four: (1) As outlined in
the article, the calculation of 𝛿is not perfect and we had to
introduce a multiplier that documents have to be of similar
size (e.g., within factor 10). It requires more experiments
to see what mechanism provides the best results from a
digital forensic perspective. (2) Documents can have a large
LD despite sharing substantial blocks of identical or nearly
identical text. One way to detect pairs that are related in
this way is to divide large documents into uniform sized
blocks and estimate the LD’s of all pairwise combinations
of blocks from the two originals. This would allow relatively
small segments, that are similar, to stand out against a
background of unrelated text. (3) File signatures are like
thumbnail images, with the differences between a pair of
signatures providing a coarse map of the differences between
the corresponding originals. A better understanding how to
characterize and quantify different kinds of similarity, and
understand the limitations of this technique would make
this property more useful. (4) All experiments have been on
books in English that have been coarsely mutilated, usually
with many blocks of lines either being deleted or duplicated.
A more systematic survey of the quality of estimates based
on files with quantified degrees and kinds of differences
would be useful. The behavior on particular types of data
would also be useful. This would be of particular interest
in applying the algorithm to binary data such as video or
compiled computer programs.
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 10 of 11
P. Coates & F. Breitinger / Identifying document similarity using fast Levenshtein Distance estimation
References
Baier, H., Breitinger, F., 2011. Security aspects of piecewise hashing
in computer forensics, in: 2011 Sixth International Conference on IT
Security Incident Management and IT Forensics, IEEE. pp. 21–36.
Bollé, T., Casey, E., 2018. Using computed similarity of distinctive
digital traces to evaluate non-obvious links and repetitions in cyber-
investigations. Digital Investigation 24, S2–S9.
Breitinger, F., Guttman, B., McCarrin, M., Roussev, V., White, D., et al.,
2014a. Approximate matching: definition and terminology. NIST
Special Publication 800, 10–6028.
Breitinger, F., Ziroff, G., Lange, S., Baier, H., 2014b. Similarity hashing
based on levenshtein distances, in: IFIP International Conference on
Digital Forensics, Springer. pp. 133–147.
Chang, D., Ghosh, M., Sanadhya, S.K., Singh, M., White, D.R., 2019.
Fbhash: A new similarity hashing scheme for digital forensics. Digital
Investigation 29, S113–S123.
GoLangPrograms.com, . Golang program for implementation
of Levenshtein distance. https://www.golangprograms.com/
golang-program- for-implementation- of- levenshtein-distance.html.
Harichandran, V.S., Breitinger, F., Baggili, I., 2016. Bytewise approximate
matching: the good, the bad, and the unknown. Journal of Digital
Forensics, Security and Law 11, 4.
Kornblum, J., 2006. Identifying almost identical files using context trig-
gered piecewise hashing. Digital investigation 3, 91–97.
MacKenzie, I.S., Soukoreff, R.W., 2002. A character-level error analysis
technique for evaluating text entry methods, in: Proceedings of the
second Nordic conference on Human-computer interaction, pp. 243–
246.
Martín-Pérez, M., Rodríguez, R.J., Breitinger, F., 2021. Bringing order
to approximate matching: Classification and attacks on similarity digest
algorithms. Forensic Science International: Digital Investigation 36,
301120.
Navarro, G., 2001. A guided tour to approximate string matching. ACM
computing surveys (CSUR) 33, 31–88.
Oliver, J., Cheng, C., Chen, Y., 2013. Tlsh–a locality sensitive hash, in:
2013 Fourth Cybercrime and Trustworthy Computing Workshop, IEEE.
pp. 7–13.
Roussev, V., 2011. An evaluation of forensic similarity hashes. digital
investigation 8, S34–S41.
Ukkonen, E., 1985. Algorithms for approximate string matching. Informa-
tion and control 64, 100–118.
Wagner, R.A., Fischer, M.J., 1974. The string-to-string correction problem.
Journal of the ACM (JACM) 21, 168–173.
Wu, G., 2020. String Similarity Metrics - Edit Distance. https://www.
baeldung.com/cs/string-similarity- edit-distance.
A. eLD Java Implementation
The implementation of our algorithm is written entirely
in Java 11 and compiled using Maven 3.6.3 and OpenJDK
64-Bit Java version: 11.0.11. The source code is generic
Java using no language features not found in any standard
imperative computer language. To keep descriptions of per-
formance simple, the current version is single-threaded only.
It is open-source and can be downloaded here: https://
coatespt.github.io/Fast-Levenshtein- Distance/. The appli-
cation supports command line arguments or the usage of a
configuration file. To generate signatures run
$ java -jar Fast-LD.jar -p config_gen.properties
and to compare the files-folder against an existing set of
signatures (sigs.csv) run
$ java -jar Fast-LD.jar -p config_cmp.properties
In these examples, our Fast-LD implementation utilizes
configuration files (CLI arguments overwrite the config)
which is listed in Fig 5. Parameters 𝐶,𝑁, and 𝐶𝐻 (Alpha-
bet) have been a explained in Sec. 2.1.3;fpoints to a CSV file
whose content is a list of files to be processes (one per line);
ld=true indicates comparison mode (Levenshtein Distance
Estimation); and ft is the signature database in the CSV
format. In order to generate signatures (no comparison), one
only has to set ld=false (for better readability, ft can be
removed as it is not needed / will be ignored).
Figure 5:
Content of the conguration le
config_cmp.properties
; the alphabet has been shortened
for better readability.
c = 251
n = 11
f = input.csv
ch = abcdefghijklnmnoprstuv[...]234567890
ld = true
ft = sigs.csv
Proceedings of the Digital Forensics Research Conference Europe (DFRWS EU), March 29April 1, 2022
Page 11 of 11
... There have been many studies on detecting text similarity to serve applications such as searching, detecting plagiarism, summarizing text, text mining, etc. Some typical results and many applications usefulness of measuring text similarity, including copy detection [1][2][3][4]. However, this problem still has many challenges related to data warehouse, techniques, algorithms, accuracy/reliability, efficiency, etc. ...
... In this study, we propose a completely new approach to detecting text similarity, which is based on the DWT method. This method is proposed and implemented through the main steps including: (1) The available original documents are converted into a set of real number sequences called source DNA through DWT; (2) To check the similarity of any text, we also use DWT to generate DNAs for that text and calculate the smallest Euclidean distance from these DNAs to the source DNAs; (3) Compared with a threshold, the distance values will indicate whether the evaluation document is similar to a certain source documents or not. Experimental results demonstrate that our proposed algorithm is highly effective in detecting text similarity by testing on a standard training data set of PAN and achieving an accuracy of over 97%. ...
... One paper (Gruber et al., 2023a) was not listed on Google Scholar. Four DFRWS self-published, i.e., non-Elsevier, papers (Gruber et al., 2023a;Schneider et al., 2021;Ottmann et al., 2022;Coates and Breitinger, 2022) were added manually to the analysis. ...
... extended three approximate matching lookup strategies, i.e., hash-database for hash-based carving (hashdb), hierarchical Bloom filter trees (hbft), and (3) flat hash maps (fhmap), each with its advantages and disadvantages depending on the use case. Coates and Breitinger (2022) presented a novel approach for identifying document similarity using a fast estimation of the Levenshtein Distance based on compressing documents in signature representations and using those signatures for comparison. Liebler and Baier (2019) introduced apx-bin, an approach combining cryptographic hashing and approximate matching for executable binary-related similarity. ...
... [47,48] Levenshtein distance This measures the difference between two strings, this is a way to measure the similarity between two domain names. [44,49] Neural networkbased features ...
Article
Full-text available
Nowadays, the malware communicates with command and control servers using domains generated algorithmically. Domain generation algorithms (DGAs) are continually evolving, which degrades the accuracy of the existing methods calls for the continuous tracking of how DGAs develop and their detection methods and calls for a good evaluation of the stage to open horizons for new detection methods. Data science plays a key role in cybersecurity by providing methods for detecting and analyzing network traffic data, including DGAs, and helping to improve the overall security of computer systems and networks. It can also be used to analyze large datasets of domain names and to develop and optimize solutions for DGA detection, by applying techniques such as machine learning, deep learning, and genetic algorithms, which have shown their effectiveness in detecting new and unknown DGAs. This paper reviews the role of data science in cybersecurity systems to detect DGAs. Hence, it also brings together publicly available domain name datasets and data science techniques utilized in recent DGA detection systems to highlight current issues and potential directions. This article additionally explains issues related to DGA detection. This will assist researchers in improving the current DGA detection algorithms as well as creating new powerful models.
... FuzzyWuzzy is a string-matching library that employs various algorithms to calculate the similarity between two strings. One such algorithm is the Levenshtein distance, which quantifies the number of character edits needed to transform one string into another [15]. ...
Article
Full-text available
The subject matter of this research revolves around addressing the escalating global health threat posed by cardiovascular diseases, which have become a leading cause of mortality in recent times. The goal of this study was to develop a comprehensive diet recommendation system tailored explicitly for cardiac patients. The primary task of this study is to assist both medical practitioners and patients in developing effective dietary strategies to counter heart-related ailments. To achieve this goal, this study leverages the capabilities of machine learning (ML) to extract valuable insights from extensive datasets. This approach involves creating a sophisticated diet recommendation framework using diverse ML techniques. These techniques are meticulously applied to analyze data and identify optimal dietary choices for individuals with cardiac concerns. In pursuit of actionable dietary recommendations, classification algorithms are employed instead of clustering. These algorithms categorize foods as "heart-healthy" or "not heart-healthy," aligned with cardiac patients’ specific needs. In addition, this study delves into the intricate dynamics between different food items, exploring interactions such as the effects of combining protein- and carbohydrate-rich diets. This exploration serves as a focal point for in-depth data mining, offering nuanced perspectives on dietary patterns and their impact on heart health. The method used central to the diet recommendation system is the implementation of the Neural Random Forest algorithm, which serves as the cornerstone for generating tailored dietary suggestions. To ensure the system’s robustness and accuracy, a comparative assessment involving other prominent ML algorithms—namely Random Forest, Naïve Bayes, Support Vector Machine, and Decision Tree, was conducted. The results of this analysis underscore the superiority of the proposed -based system, demonstrating higher overall accuracy in delivering precise dietary recommendations compared with its counterparts. In conclusion, this study introduces an advanced diet recommendation system using ML, with the potential to notably reduce cardiac disease risk. By providing evidence-based dietary guidance, the system benefits both healthcare professionals and patients, showcasing the transformative capacity of ML in healthcare. This study underscores the significance of meticulous data analysis in refining dietary decisions for individuals with cardiac conditions.
... • Lexical matching (Alfonseca et al., 2001;Yang et al., 2018). We use EQ (1) to calculate the lexical similarity between question and candidate sentence (Coates and Breitinger, 2022). Then we choose the top two sentences with the highest score as the final answer. ...
... • Lexical matching (Alfonseca et al., 2001;Yang et al., 2018). We use EQ (1) to calculate the lexical similarity between question and candidate sentence (Coates and Breitinger, 2022). Then we choose the top two sentences with the highest score as the final answer. ...
Preprint
The machine reading comprehension (MRC) of user manuals has huge potential in customer service. However,current methods have trouble answering complex questions. Therefore, we introduce the Knowing-how & Knowing-that task that requires the model to answer factoid-style, procedure-style, and inconsistent questions about user manuals. We resolve this task by jointly representing the steps and facts in a graph (TARA), which supports a unified inference of various questions. Towards a systematical benchmarking study, we design a heuristic method to automatically parse user manuals into TARAs and build an annotated dataset to test the model's ability in answering real-world questions. Empirical results demonstrate that representing user manuals as TARAs is a desired solution for the MRC of user manuals. An in-depth investigation of TARA further sheds light on the issues and broader impacts of future representations of user manuals. We hope our work can move the MRC of user manuals to a more complex and realistic stage.
Article
Full-text available
This work utilizes collected and organized instructional data from the field of chemical science to fine-tune mainstream open-source large language models. To objectively evaluate the performance of the fine-tuned models, we have developed an automated scoring system specifically for the chemistry domain, ensuring the accuracy and reliability of the evaluation results. Building on this foundation, we have designed an innovative chemical intelligent assistant system. This system employs the fine-tuned Mistral NeMo model as one of its primary models and features a mechanism for flexibly invoking various advanced models. This design fully considers the rapid iteration characteristics of large language models, ensuring that the system can continuously leverage the latest and most powerful AI capabilities. A major highlight of this system is its deep integration of professional knowledge and requirements from the chemistry field. By incorporating specialized functions such as molecular visualization, SMILES string processing, and chemical literature retrieval, the system significantly enhances its practical value in chemical research and applications. More notably, through carefully designed mechanisms for knowledge accumulation, skill acquisition, performance evaluation, and group collaboration, the system can optimize its professional abilities and interaction quality to a certain extent.
Article
Full-text available
With the rapid growth of the World Wide Web and Internet of Things, a huge amount of digital data is being produced every day. Digital forensics investigators face an uphill task when they have to manually screen through and examine tons of such data during a crime investigation. To expedite this process, several automated techniques have been proposed and are being used in practice. Among which tools based on Approximate Matching algorithms have gained prominence, e.g., ssdeep, sdhash, mvHash etc. These tools produce hash signatures for all the files to be examined, compute a similarity score and then compare it with a known reference set to filter out known good as well as bad files. In this way, exact as well as similar matches can be screened out. However, all of these schemes have been shown to be prone to active adversary attack, whereby an attacker, by making feasible changes in the content of the file, intelligently modifies the final hash signature produced to evade detection. Thus, an alternate hashing scheme is required which can resist this attack. In this work, we propose a new Approximate Matching scheme termed as - FbHash. We show that our scheme is secure against active attacks and detects similarity with 98% accuracy. We also provide a detailed comparative analysis with other existing schemes and show that our scheme has a 28% higher accuracy rate than other schemes for uncompressed file format (e.g., text files) and 50% higher accuracy rate for compressed file format (e.g., docx etc.). Our proposed scheme is able to correlate a fragment as small as 1% to the source file with 100% detection rate and able to detect commonality as small as 1% between two documents with appropriate similarity score. Further, our scheme also produces the least false negatives in comparison to other schemes.
Article
Full-text available
The fast growth of the average size of digital forensic targets demands new automated means to quickly, accurately and reliably correlate digital artifacts. Such tools need to offer more flexibility than the routine known-file filtering based on crypto hashes. Currently, there are two tools for which NIST has produced reference hash sets–ssdeep and sdhash. The former provides a fixed-sized fuzzy hash based on random polynomials, whereas the latter produces a variable-length similarity digest based on statistically-identified features packed into Bloom filters.This study provides a baseline evaluation of the capabilities of these tools both in a controlled environment and on real-world data. The results show that the similarity digest approach significantly outperforms in terms of recall and precision in all tested scenarios and demonstrates robust and scalable behavior.
Conference Paper
Full-text available
Although hash functions are a well-known method in computer science to map arbitrary large data to bit strings of a fixed length, their use in computer forensics is currently very limited. As of today, in a pre-step process hash values of files are generated and stored in a database, typically a cryptographic hash function like MD5 or SHA-1 is used. Later the investigator computes hash values of files, which he finds on a storage medium, and performs look ups in his database. This approach has several drawbacks, which have been sketched in the community, and some alternative approaches have been proposed. The most popular one is due to Jesse Kornblum, who transferred ideas from spam detection to computer forensics in order to identify similar files. However, his proposal lacks a thorough security analysis. It is therefore one aim of the paper at hand to present some possible attack vectors of an active adversary to bypass Kornblum's approach. Furthermore, we present a pseudo random number generator being both more efficient and more random compared to Kornblum's pseudo random number generator.
Article
Fuzzy hashing or similarity hashing (a.k.a. bytewise approximate matching) converts digital artifacts into an intermediate representation to allow an efficient (fast) identification of similar objects, e.g., for blacklisting. They gained a lot of popularity over the past decade with new algorithms being developed and released to the digital forensics community. When releasing algorithms (e.g., as part of a scientific article), they are frequently compared with other algorithms to outline the benefits and sometimes also the weaknesses of the proposed approach. However, given the wide variety of algorithms and approaches, it is impossible to provide direct comparisons with all existing algorithms. In this paper, we present the first classification of approximate matching algorithms which allows an easier description and comparisons. Therefore, we first reviewed existing literature to understand the techniques various algorithms use and to familiarize ourselves with the common terminology. Our findings allowed us to develop a categorization relying heavily on the terminology proposed by NIST SP 800-168. In addition to the categorization, this article presents an abstract set of attacks against algorithms and why they are feasible. Lastly, we detail the characteristics needed to build robust algorithms to prevent attacks. We believe that this article helps newcomers, practitioners, and experts alike to better compare algorithms, understand their potential, as well as characteristics and implications they may have on forensic investigations.
Article
This work addresses the challenge of discerning non-exact or non-obvious similarities between cybercrimes, proposing a new approach to finding linkages and repetitions across cases in a cyber-investigation context using near similarity calculation of distinctive digital traces. A prototype system was developed to test the proposed approach, and the system was evaluated using digital traces collected during actual cyber-investigations. The prototype system also links cases on the basis of exact similarity between technical characteristics. This work found that the introduction of near similarity helps to confirm already existing links, and exposes additional linkages between cases. Automatic detection of near similarities across cybercrimes gives digital investigators a better understanding of the criminal context and the actual phenomenon, and can reveal a series of related offenses. Using case data from 207 cyber-investigations, this study evaluated the effectiveness of computing similarity between cases by applying string similarity algorithms to email addresses. The Levenshtein algorithm was selected as the best algorithm to segregate similar email addresses from non-similar ones. This work can be extended to other digital traces common in cybercrimes such as URLs and domain names. In addition to finding linkages between related cybercrime at a technical level, similarities in patterns across cases provided insights at a behavioral level such as modus operandi (MO). This work also addresses the step that comes after the similarity computation, which is the linkage verification and the hypothesis formation. For forensic purposes, it is necessary to confirm that a near match with the similarity algorithm actually corresponds to a real relation between observed characteristics, and it is important to evaluate the likelihood that the disclosed similarity supports the hypothesis of the link between cases. This work recommends additional information, including certain technical, contextual and behavioral characteristics that could be collected routinely in cyber-investigations to support similarity computation and link evaluation.
Article
Hash functions are established and well-known in digital forensics, where they are commonly used for proving integrity and file identification (i.e., hash all files on a seized device and compare the fingerprints against a reference database). However, with respect to the latter operation, an active adversary can easily overcome this approach because traditional hashes are designed to be sensitive to altering an input; output will significantly change if a single bit is flipped. Therefore, researchers developed approximate matching, which is a rather new, less prominent area but was conceived as a more robust counterpart to traditional hashing. Since the conception of approximate matching, the community has constructed numerous algorithms, extensions, and additional applications for this technology, and are still working on novel concepts to improve the status quo. In this survey article, we conduct a high-level review of the existing literature from a non-technical perspective and summarize the existing body of knowledge in approximate matching, with special focus on bytewise algorithms. Our contribution allows researchers and practitioners to receive an overview of the state of the art of approximate matching so that they may understand the capabilities and challenges of the field. Simply, we present the terminology, use cases, classification, requirements, testing methods, algorithms, applications, and a list of primary and secondary literature.
Conference Paper
It is increasingly common in forensic investigations to use automated pre-processing techniques to reduce the massive volumes of data that are encountered. This is typically accomplished by comparing fingerprints (typically cryptographic hashes) of files against existing databases. In addition to finding exact matches of cryptographic hashes, it is necessary to find approximate matches corresponding to similar files, such as different versions of a given file. This paper presents a new stand-alone similarity hashing approach called saHash, which has a modular design and operates in linear time. saHash is almost as fast as SHA-1 and more efficient than other approaches for approximate matching. The similarity hashing algorithm uses four sub-hash functions, each producing its own hash value. The four sub-hashes are concatenated to produce the final hash value. This modularity enables sub-hash functions to be added or removed, e.g., if an exploit for a sub-hash function is discovered. Given the hash values of two byte sequences, saHash returns a lower bound on the number of Levenshtein operations between the two byte sequences as their similarity score. The robustness of saHash is verified by comparing it with other approximate matching approaches such as +sdhash+.
Conference Paper
Cryptographic hashes such as MD5 and SHA-1 are used for many data mining and security applications -- they are used as an identifier for files and documents. However, if a single byte of a file is changed, then cryptographic hashes result in a completely different hash value. It would be very useful to work with hashes which identify that files were similar based on their hash values. The security field has proposed similarity digests, and the data mining community has proposed locality sensitive hashes. Some proposals include the Nilsimsa hash (a locality sensitive hash), Ssdeep and Sdhash (both Ssdeep and Sdhash are similarity digests). Here, we describe a new locality sensitive hashing scheme the TLSH. We provide algorithms for evaluating and comparing hash values and provide a reference to its open source code. We do an empirical evaluation of publically available similarity digest schemes. The empirical evaluation highlights significant problems with previously proposed schemes; the TLSH scheme does not suffer from the flaws identified.
Article
The edit distance between strings a1 … am and b1 … bn is the minimum cost s of a sequence of editing steps (insertions, deletions, changes) that convert one string into the other. A well-known tabulating method computes s as well as the corresponding editing sequence in time and in space O(mn) (in space O(min(m, n)) if the editing sequence is not required). Starting from this method, we develop an improved algorithm that works in time and in space O(s · min(m, n)). Another improvement with time O(s · min(m, n)) and space O(s · min(s, m, n)) is given for the special case where all editing steps have the same cost independently of the characters involved. If the editing sequence that gives cost s is not required, our algorithms can be implemented in space O(min(s, m, n)). Since s = O(max(m, n)), the new methods are always asymptotically as good as the original tabulating method. As a by-product, algorithms are obtained that, given a threshold value t, test in time O(t · min(m, n)) and in space O(min(t, m, n)) whether s ⩽ t. Finally, different generalized edit distances are analyzed and conditions are given under which our algorithms can be used in conjunction with extended edit operation sets, including, for example, transposition of adjacent characters.