Greetings
I am looking for scholarly research regarding hash validity. I am looking to ascertain the validity of hash values regarding courtroom testimony for various hash functions. I am specifically refereeing to forensic use of hash functions for the identification or validation of files or strings. For example the hash validity of MD5 collision being one in ################.
Where ################ equates to N ~ 2^42 = 4.398.046.511.104 😯
Though this post is in Italian, it contains a large number of references
http//
Mainly (reyour question)
http//
http//
jaclaz
SHA1 (160 bit) probability that two files will be similar one in a quindecillion chance.
SHA1 (160 bit) probability that two files will be similar one in a quindecillion chance.
No. (
There is nothing like "similar" in hashing, it is pretty much binary, two hashes are the same (=1) or they are NOT (=0).
As a matter of fact, to have a collision, the files need NOT to be "similar", and it is very likely that they will be quite different.
The chances of finding a collision are actually between 2^69 (specific calculation) and 2^80 (random one), see previous linked to paper and
https://
SHA-1 produces a 160-bit hash. That is, every message hashes down to a 160-bit number. Given that there are an infinite number of messages that hash to each possible value, there are an infinite number of possible collisions. But because the number of possible hashes is so large, the odds of finding one by chance is negligibly small (one in 2^80, to be exact). If you hashed 2^80 random messages, you'd find one pair that hashed to the same value. That's the "brute force" way of finding collisions, and it depends solely on the length of the hash value. "Breaking" the hash function means being able to find collisions faster than that. And that's what the Chinese did.
They can find collisions in SHA-1 in 2^69 calculations, about 2,000 times faster than brute force.
See also (about the feasability of the attempt and related costs)
https://
But the point is …? ?
jaclaz
SHA1 (160 bit) probability that two files will be similar one in a quindecillion chance.
Good exanple, mainly because it raises so many questions.
It's not quite right. Given file A, the comparison file must be chosen at random – if your choice is predictable, so is the hash.
Also, the contents of the file has to be random – for much the same reason.
At that point the OP must ask are those criteria fulfilled? Or is he comparing file A with a file that isn't a random choice, but has been chosen by someone else? In that case, nothing can be said about likelihood of getting the same hash by using this kind of approach.
Add to that the fact that very few files have random contents. How does that affect the result? Much? Little?
For the right type of data, the probability of a collision in MD5 is 100%
http//
SHA-1 is also compromised https://
MD5 is properly, properly broken though.
Edit
Does this matter? Well the collisions are engineered, so for validating that some data which you have control of hasn't changed…it's not ideal, but it's not world ending. For checking the identity of previously known data - that is where that MD5 issue is dangerous. If you can hide malicious content in the data of known-good data… well, we just shouldn't be using MD5…
For the right type of data, the probability of a collision in MD5 is 100%
NO. cry
That is NOT the probability of a collision.
That is the fact that a collision can be artificially provoked through (a simple) calculation.
The calculation method found is simple enough to be within practical limits of *any* computer, hence MD5 is "compromised", but the probability of a collision on random files/data remains ~2^64, or more exactly 1.26*2^64 (but as seen in the already linked to sources can be lowered in a collision attack to around 2^42).
The SHA-1 as you can read in the links (also by Bruce Schneier) I posted earlier is NOT "compromised" as the calculation method found only reduces the number of brute force attempts from 2^80 (probability of a collision) to 2^69, which is however yet well beyond the possibilities of most computers
https://
Then, since log2(350) ~ 8.4 the cost of the attack will be approximately
213 * 28.4 = 221.4 ~ $2.77M in 2012
211 * 28.4 = 219.4 ~ $700K by 2015
29 * 28.4 = 217.4 ~ $173K by 2018
27 * 28.4 = 215.4 ~ $43K by 2021
A collision attack is therefore well within the range of what an organized crime syndicate can practically budget by 2018, and a university research project by 2021.
jaclaz
P.S. Edited to make more clear the difference between collision probability and range for a collision attack.
Except much of this is really for entertainment purposes and to annoy the opposing counsel (which is worthwhile in itself).
After all, the original post was about "forensic use of hash functions for the identification or validation of files or strings."
As so succinctly pointed out by jaclaz, "[…] to have a collision,[…] it is very likely that they will be quite different."
When I hash things, I also know not just the hash (as in cryptography), but the file type, MACD, location, ACL and all kind of other details. Each detail further increases the requirements and the exponent of the collision possibility.
Think about it. What are the chances of creating two sequences of bytes which are both valid Microsoft Word 2007 documents, both have matching MFT entries, etc? Slightly less complicated with raw bytes based data (such as text files), but even there the text file can only use certain ASCII, in a specific sequence, and so on.
Except much of this is really for entertainment purposes and to annoy the opposing counsel (which is worthwhile in itself).
For all we know, that may be the situation of the original poster to annoy or to be annoyed.
For all we know, that may be the situation of the original poster to annoy or to be annoyed.
If the second, you can leave it to me wink , I usually can manage to annoy someone in a very short time. D
For NO apparent reason mrgreen
http//
jaclaz