Two files can be the same size and have different hashes, indicating different content, and therefore the files are not duplicates.
Harlan - if you read my original post you will see I acknowledge the above.
My point is that the hashing function will detect a difference in the file irrespective of the file size. Therefore
If there are any dups based on size + hash
should simply be if there are any dups based on hash - checking the size is a wasted exercise because two files with a different size cannot have the same hash (within he bounds of the hashing function of course). If they are the same size but different content/hash then you do not need to check the size either.
Or are you saying that two files of a different size can have the same hash?
He's saying two files can be different sizes and have the same hash. Generally this will only occur if you've intentionally constructed them that way by generating MD5 hash collisions. Since two files that have been made to have the same hash often (always?) have different file sizes, checking both hash and size will detect these. (In a forensic situation, you'd probably want a warning about this specifically – a hash collision is extremely unlikely to occur naturally.) At this point, you could also address it by using SHA1 hashes instead of MD5s, or by hashing with two algorithms simultaneously (since an MD5 hash collision does not mean the SHA1 hashes will collide, and vice versa).
He's saying two files can be different sizes and have the same hash. Generally this will only occur if you've intentionally constructed them that way by generating MD5 hash collisions. Since two files that have been made to have the same hash often (always?) have different file sizes, checking both hash and size will detect these. (In a forensic situation, you'd probably want a warning about this specifically – a hash collision is extremely unlikely to occur naturally.) At this point, you could also address it by using SHA1 hashes instead of MD5s, or by hashing with two algorithms simultaneously (since an MD5 hash collision does not mean the SHA1 hashes will collide, and vice versa).
In that case he is over complicating the issue. For most deduplication purposes the hash should be enough UNLESS you suspect that someone has been specifically crafting file to have the same hash. Something that I believe would be rather misguided as the person carfting the files has no idea what algorithm a forensic examiner might chose to run.
I wasn't aware that crafted files ALWAYS have a different size, do you have a link to this?
There IS an argument for checking files sizes, and this is something I implemented in a scanner I wrote to detect IIoC back in '98 for teh UK police. As two files with a different size MUST have different hashes (the MD5 vulnerability aside) then by checking the file sizes first you can avoid having to calculate a hash.
Generally
I have only heard of one instance of two different files genuinely having the same hash (there may of course be more. That was a about 5-6 years ago if memory serves and I think it was a Gendarmarie case. I think generally may not have been the word you were looking for )
I said on another post that there does seem a tendancy in this industry to over complicate things. If all you are doing is looking for duplicate pictures or duplicate emails in an office environment (The OP wanted to get rid of dups) then it is safe to ignore crafted hash collsions and just rely on teh hash MD5 or SHA1 or ….
If you're using MD5 to check for duplicates in a forensic situation, you should consider intentional collision to be a possible – albeit very unlikely – case. Even less likely is accidental duplicates. (As you indicate, very, very unlikely.)
Two files with different lengths are not strictly required to have different hashes. However, you are correct that two files with different lengths cannot possibly be the same file, so if you're eliminating duplicates, you only need to compute hashes on files of the same length.
I'm not sure if crafted collisions are required to have different lengths, but if I remember the paper correctly, you cannot engineer them to have the same length – if they have the same length, it is by luck.
If you're using MD5 to check for duplicates in a forensic situation, you should consider intentional collision to be a possible – albeit very unlikely – case.
Consider yes, use no.
If I had a case whereby I thought that
a) the person being investigated was clever enough
b) it was relevant (no point if deduplicating emails, restored mail from a hundred tapes etc.)
c) if it was a case where keyword searches etc. that I apply pre-deduplication would not identify the file
d) the person had an inkling that I might use MD5 over SHA1
Then I would consider it and immediately bin the idea in favour of SHA1
In god knows how many investigations I have carried in the 16 years I have been working in forensics, I have never come across anyone/any case where this has been a real possibility (admittedly the problem hasn't been around that long ) )
Lets get real here and stop bigging up the problem, what is the likelihood of of some bad guy hiding the one or two files that hang him using this technique. I have never heard of a single case where someone has utilised MD5 collisions to hide data.
I keep coming back to this "there does seem a tendancy in this industry to over complicate things". There was a tendancy by a few in the early days of forensics to pick up on a particular issue and blow it all out of proportion.
I remember talking about inter sector gaps on floppy disks (a necessary run of a particular pattern between the end of one sector and the address mark for the next to allow the electronics to sync with the underlying clock and data). A few weeks later he was teaching about hiding data in the inter sector gap in a forensics training course.
so if you're eliminating duplicates, you only need to compute hashes on files of the same length.
Any fast deduplication tool WILL use the file length, if a single large file is the only one of a given length then there is no point in checking the hash - which can be a length process - my argument earlier was that this was not necessary, not that it might not help.
*IF* I thought this was a real problem on a case then I would use SHA1 (actually I use SHA1 in all the tools I currently write).
I certainly wouldn't invest much time into worry about collisions. They're a real but remote possibility, but as you point out, either checking file sizes first or using SHA1 covers this. (SHA1 collisions or MD5 collisions with the same file size simply aren't going to be seen in practice at this point.)
MD5 hash algorithm is a widely accepted method for "fingerprinting" a file. To my knowledge the MD5 has never been successfully challenged in court. There are many free apps that will hash documents. I think we are splitting hair with the MD5 collision talk. I have personally never head of anyone having an MD5 collision outside of the couple of people who have reverse engineered the algorithm to create two file that created the same hash.
… should simply be if there are any dups based on hash - checking the size is a wasted exercise because two files with a different size cannot have the same hash (within he bounds of the hashing function of course).
Have you calculated the waste? If you do, I'm sure you would turne that statement upside down
comparing file hashes is a wasted exercise unless file sizes are the same.
The computational cost of a calculating a file hash is considerably larger than the cost of calculating a file size.
Excuse my ignorance but, why all this discussion on MD5 collisions?
If during an investigation, you found two files with the same MD5 hash, wouldn't you validate that using SHA1, etc?
The issue is relevant if you are using MD5 to remove duplicate files