If the hashing method has not frequent "collisions", we have, unless I am mistaken ?
512^256 different possible sector contents. (and consequently the same number of different hashes)
512 to the power of 256 is a pretty big number 😯 , so if you really happen to find a handful of sequential sectors matching - provided that they are not "common sectors", like 00ed or F6'ed ones or specific headers/footers, and the like, you have a very, very high probability of the method to give a correct output.
But point is, what can you do with the output?
If I get it right ? , what you can get once that you find a substantial amount of sector hashes for a given file on a hard disk under examination "concide" is (provided that the actual "whole" file filename.ext is not in the "normal" filesystem, in which case you need NOT the hashing wink )
At some time file filename.ext was most probably present on the hard disk, or at least part of it was.
We have NO way to know if it was EVER accessed, viewed or modified.
We have NO way to know if it was saved by the user intentionally or saved by any automated application, application itself that may be either initiated manually and voluntarily by the user or triggered unknowingly by a number of actions the user might have performed, or even by a completely automated program, such as a rootkit or more generally malware or "hostile" web site.
We have NO way to know WHEN the above happened.
As I see it, you do not need a big number of sectors to be reasonably certain that the file was there, but it seems to me like you can do very little with just this info.
jaclaz
Great discussion. I just completed the 508 SANS course with Rob Lee where we had a good discussion about Fuzzy Hashing and doing partial file matching using SSDEEP, which is synonymous with what we are talking about here.
My 2 cents are that, as examiners we should use this technique in the same way we do keyword searches. Its a fantastic way to filter and find evidence, but on its own it means as much as a keyword hit (not very much). The context and interpretation of content and relevance through further examination is where all the value is.
This is going to be an awesome technique for comparing pictures, documents, source code, entire directories and eventually even video as the music and film industry are heading this direction fast.
… so if you really happen to find a handful of sequential sectors matching - provided that they are not "common sectors", like 00ed or F6'ed ones or specific headers/footers, and the like, you have a very, very high probability of the method to give a correct output.
Assuming sector contents is evenly distributed. But is it? Do we have any evidence as to that?
I would guess that quite a number of sectors would appear more often – all zero, for one, and sectors corresponding to runtime code libraries for another. Or, in video files, perhaps, logo and trademark contents.
As I see it, you do not need a big number of sectors to be reasonably certain that the file was there, but it seems to me like you can do very little with just this info.
Not on its own, no. But in a case, say, covering three or four computers/hard drives, considerable overlap of sectors (on one computer, a full file, on the other unallocated sectors, say) may suggest content that had been present on all at some point. Whether that's useful or not depends on the case, but it is a tool for establishing correlation.
The expense of using this particlar tool is pretty high – so it probably doesn't have a place in run of the mill investigations.
I would guess that quite a number of sectors would appear more often – all zero, for one, and sectors corresponding to runtime code libraries for another. Or, in video files, perhaps, logo and trademark contents.
… so if you really happen to find a handful of sequential sectors matching - provided that they are not "common sectors", like 00ed or F6'ed ones or specific headers/footers, and the like, you have a very, very high probability of the method to give a correct output.
wink
jaclaz
Great discussion. I just completed the 508 SANS course with Rob Lee where we had a good discussion about Fuzzy Hashing and doing partial file matching using SSDEEP, which is synonymous with what we are talking about here.
If block-level hashing means "calculate the MD5/SHA-1 of every sector and use these to determine whether parts of a known file stream exist on a drive", then, no, this is not synonymous with ssdeep. ssdeep uses context-triggered piecewise hashing, which is a different technique; notably, the segments of data that are hashed are not fixed-length.
Jesse Kornblum has some great presentations on his blog that go over how CTPH works.
I think block-level hashing is "okay" (and merely that) for detecting that fragments of a known file exist on a hard drive. In terms of how to explain that, and what it means, my own view is that it means that the particular fragment exists in the evidence. You then need to look at the fragment. Trying to come up with any sort of statistical argument (e.g., "this proves with 99.9876% confidence the suspect is guilty and should be hanged") seems nonsensical unless you can come up with an expected frequency distribution for the contents of sectors, and some idea of the variance of an average hard drive's contents from that frequency distribution. Someone claiming to have these statistics would either have an ocean to boil or a bridge to sell you.
Jon
For those of you with access to the Guidance portal the script is here
https://
I suggest you go get it and try it out for yourself.
The technique involves (as Jon rightly guessed) hashing each block of a file. The size of the block is user defined but blocks in anything other than sector or cluster sizes don't make sense. The last block is treated as special (because it might contain file slack).
The list of hashes is then used to test blocks either in the whole case or just those entries you want to test (like unallocated clusters).
If a whole file is found in contiguous blocks then that's a pretty straight forward bookmark.
When only part of a file is found, I think I'm right in saying that a graphic showing the matching blocks is created (pretty neat from what I saw at F3).
It will find heavily fragmented files in unallocated space - no bother.
If you doubt the usefulness of this script then don't bother with it. I will use it because I have had cases where I could have used this to speed up my enquiry by orders of magnitude.
I've had a case where a drive had been repartitioned at least 3 times. I was recovering data from the earliest partition and eventually reconstructed a privoxy log with dates and URL's. Some of the URL's indicated unlawful material so I isolated all the indicative URL's ending in '.jpg'. I then used wget to fetch whatever images were still out there (some 2 years after the creation of the privoxy log) From a list of around 100 images I got about 25 that were still being hosted and were indeed unlawful. In this case I found fragments of 5 files and all of the file for the other 20 on the suspect machine (even though some of them were heavily fragmented). At the time I had to do it by hand. With this script I could have done the same thing in less than a day rather than the two or three weeks it actually took.
My vote is that you put this tool in your armoury (so long as you know what it is doing of course)
Oh by the way, I got a conviction at court on those recovered, fragmented files )
Paul
One aspect of block hashing not mentioned is the location of the matching blocks. One knows where in the original file the matching hash is, so one can test if the matching hash on the suspect disk is in a valid location.
The method is often used to help search for files in slack and unallocated space. My thoughts are probably best decribed in the example below
Lets consider an NTFS disk with the normal cluster size 8 sectors. If a sector is match is found on sector 0x36 of the original file (start counting at 0), then this should be in sector 0x6 of an NTFS cluster. This could be the case that sectors 0-5 of the cluster have been overwritten, in which case one would expect that sector 0x7 of the cluster and 0x37 of the file to also match. If this is not the case, then the match should be treated as a false positive, and ignored.
NB, the above logic is flawed if the drive has been repartitioned with different cluster start values.