I'm working on a case where we have extracted raw data surrounding search hits from Unallocated Space. We have to perform a de-duping on these files before sending on to our e-Discovery vendor for hosting. It seemed to us that this is a good use for ssdeep , but I have a couple questions about the output, and I'm hoping someone can clarify.
ssdeep was run against a test set with these parameters -rdbct 70
As I understand it, this should give me A matches B, but not B matches A, and so on; all at more than 70% match.
Below is a sample of output
Test10.txt Test1.txt 100
Test101.txt Test1.txt 100
Test101.txt Test10.txt 100
Test103.txt Test1.txt 100
Test103.txt Test10.txt 100
Test103.txt Test101.txt 100
Test104.txt Test1.txt 100
Test104.txt Test10.txt 100
Test104.txt Test101.txt 100
Test104.txt Test103.txt 100
Test106.txt Test1.txt 100
Test106.txt Test10.txt 100
Test106.txt Test101.txt 100
Test106.txt Test103.txt 100
Test106.txt Test104.txt 100
Test107.txt Test1.txt 100
Test107.txt Test10.txt 100
Test107.txt Test101.txt 100
Test107.txt Test103.txt 100
Test107.txt Test104.txt 100
Test107.txt Test106.txt 100
Here are my questions
1. What does it take to get a 100% match? Being this is Unallocated, I expect overlap between search hits, but not that they would be completely identical.
2. Why do I see that Test10 matches Test1, but not the inverse? I would expect to see that Test1 matches Test10 (being as 1 comes before 10), then *not* see that Test10 matches Test1.
I have read another post wherein it seems that fuzzy hashing may not work at all on small files; our test set is indeed comprised of small files, so that may be an issue. I can increase the size to some extent but it will still be relatively small, being Unallocated space and all.
Tnx for any help,
LM
LM,
Just out of curiosity, but have you looked at the content of these files?
Sure, the content is largely junk (from what I've seen). This is not unexpected, given it's all from Unallocated space. We have advised the client of this in advance, and explained that with such an unnatural extraction of data around search hits, there's going to be a lot of overlap, and a lot of junk.
To my eye, the content does not appear to be a true 100% match, but since much of it is not human readable, I must confess I have not spent a lot of time manually dissecting and comparing.
For our test set we extracted 500 characters either side of the search hit out into a raw text file. Thus, the average file size is ~ 1KB, which could be causing problems as well, I guess.
LM
To summarize (for posterity) what I have found through repeated testing and an offline exchange with Jesse Kornblum (thank you again!)…
The match percentage is based on the hash comparison, and not file content. So as I understand/interpret it, for the "needle in a haystack" sort of use, it's not an exact science; it should give a good idea as to the "sameness" of the file. Thus, a 100% match doesn't mean that it's the same file, but that the hash signatures are virtually identical. File size comes into play here, and should be at least 4KB for any accuracy.
The -rdbct 70 flag does provide the "needle/haystack" info needed in our scenario. It hashes and compares all files in the given directory. The order hashed/compared and corresponding output are not under our control. So the 2nd part of my question is answered that Test10 is shown "before" Test1, because ssdeep hashed/compared Test10 before Test1. In other words, the "order" of the files as they appear on the drive does not reflect the "order" in which they are processed.
Hope that makes sense and is a fair explanation.
LM
I'm confused by the last posting.
In my book, hashing is MD5, SHA-1 (or similar). If the hashes are the same, the file is the same. A single bit different in a file, produces a very different hash. There is no such thing as a similar hash.
Files of a different length could produce the same hash, but so can the same lottery numbers win the big prize two weeks in a row. Just that in the real world this does not happen.
Have I misunderstood something, or are we talking about a different understanding of hashing?
Have I misunderstood something, or are we talking about a different understanding of hashing?
Your confusion is understandable and your interpretation of a hash is entirely correct.
The technique/program that is being referred to in this topic is 'fuzzy hashing' and its implementation 'ssdeep'. This technique can be used to detect files that are similar or even partly identical (because for example maybe one file is truncated) - this is a bit of a simplification.
The website of the tool that you can use to do this (
Reading the DFRWS article it does not mention the word fuzzy. What it does do is to break a file down to smaller sections and hash them. This way, a file with 50 out of 55 hash values being the same would be a similar file.
I think fuzzy hashing is like the term 'almost unique'. Something is unique or not. A hash is a match or not a match.
However, you post has prompted me to investigate this area further, Thanks
Reading the DFRWS article it does not mention the word fuzzy.
I think fuzzy hashing is like the term 'almost unique'. Something is unique or not. A hash is a match or not a match.
You would have to challenge Jesse Kornblum for using this term - it's featured on the ssdeep page (and, IMHO, it is a bit more understandable than 'context triggered piecewise hashes' 😉 ).
Regardless of the semantics, the method of identifying if a file is similar to another file instead of looking for identical matches can be very useful in a lot of situations (as, like you state correctly in your first posting, a one bit change in a file results in a completely different hash value - whereas from a user perspective these files will most likely be very similar and maybe even related - like a temporary file and stored version of a text document.).
Anyway, good that the links were useful!