deduplication softw...
 
Notifications
Clear all

deduplication software

30 Posts
16 Users
0 Reactions
1,890 Views
(@gtorgersen)
Trusted Member
Joined: 17 years ago
Posts: 70
 

I think Paul said it perfect. We over complicate things to the very extreme. MD5 is a perfectly good has algorithm. Using some other is just a waste of CPU power. The likely hood of two files having the same MD5 is 2^128. Compound this with other factors like the two files in question are going to exist in the universe of document that you are testing and the statistics get even slimmer.


   
ReplyQuote
(@roncufley)
Estimable Member
Joined: 21 years ago
Posts: 161
 

If you want something cheap and effective, try this

http//noclone.net


   
ReplyQuote
(@indur)
Trusted Member
Joined: 17 years ago
Posts: 67
 

Using SHA1 is not a waste of CPU power in this case, as computing hashes is limited by disk I/O and not by how much computation is required for the hash. (Even computing MD5, SHA1, SHA-256, and SHA-512 simultaneously should be I/O-bound.)


   
ReplyQuote
(@roncufley)
Estimable Member
Joined: 21 years ago
Posts: 161
 

The likely hood of two files having the same MD5 is 2^128.

No, the chance of two particular files having the same MD5 hash is 1 in 2**128, not any two files. If you have drawn an Ace the chance of drawing a second Ace is not the same as the chance of drawing any Pair from a deck.


   
ReplyQuote
(@bithead)
Noble Member
Joined: 20 years ago
Posts: 1206
 

The chance of two files having the same MD5 hash value will be 1/(2^128), roughly equal to 1/(3.4 X 10^38), or roughly the chance of one in 340 billion billion billion billion.


   
ReplyQuote
(@indur)
Trusted Member
Joined: 17 years ago
Posts: 67
 

Yes, and the chance of a random MD5 collision in a set of N files is roughly N^2/2^129.


   
ReplyQuote
(@gtorgersen)
Trusted Member
Joined: 17 years ago
Posts: 70
 

Yes sorry for the typo 1^128. My point is still that the reality of you having an MD5 collision is so remote that why worry about it. The courts have validated it as a valid method of fingerprinting a file.


   
ReplyQuote
(@alanwo)
New Member
Joined: 16 years ago
Posts: 1
 

MD5 is not safe, see example of MD5 collisions
http//noclone.net/info/Trueduplicate.aspx


   
ReplyQuote
jaclaz
(@jaclaz)
Illustrious Member
Joined: 18 years ago
Posts: 5133
 

MD5 is not safe, see example of MD5 collisions
http//noclone.net/info/Trueduplicate.aspx

VERY clear sentence roll

Why MD5 is not reliable?

Some of the duplicate finding software in the market uncovers duplicate files by comparing MD5 hash string of file content. However it is not reliable that there is a chance of MD5 hash collision.

I like how the article expands on the computation of probabilities of an MD5 collision…. 😯

Unfortunately the actual "example" has no working links…. (

But they are here wink
http//noclone.net/info/hello.exe
http//noclone.net/info/erase.exe

It seems to me like the example is a "forged" one, i.e. something intentionally written, since there are only 6 bytes difference in FC /B ?

C\Downloaded\testnoclone>fc /B hello.exe erase.exe
Confronto in corso dei file hello.exe e ERASE.EXE
00000953 09 89
0000096D 86 06
0000097B 91 11
00000993 28 A8
000009AD 54 D4
000009BB E8 68

…. the "scaring string" is present also in the "harmless" hello.exe….

jaclaz


   
ReplyQuote
(@indur)
Trusted Member
Joined: 17 years ago
Posts: 67
 

Yes, "hello" and "erase" are intentionally-produced collisions. There's a little "evilize" program that will allow you to generate such executables.


   
ReplyQuote
Page 3 / 3
Share: