±Forensic Focus Partners

Become an advertising partner

±Your Account


Username
Password

Forgotten password/username?

Site Members:

New Today: 5 Overall: 36115
New Yesterday: 4 Visitors: 138

±Follow Forensic Focus

Forensic Focus Facebook PageForensic Focus on TwitterForensic Focus LinkedIn GroupForensic Focus YouTube Channel

RSS feeds: News Forums Articles

±Latest Articles

±Latest Videos

±Latest Jobs

SAME content DIFFERENT hash values - Explanation Plz

Computer forensics discussion. Please ensure that your post is not better suited to one of the forums below (if it is, please post it there instead!)
Reply to topicReply to topic Printer Friendly Page
Forum FAQSearchView unanswered posts
 
  

NeriMatrixx
Newbie
 

SAME content DIFFERENT hash values - Explanation Plz

Post Posted: Oct 08, 19 10:23

I recently found out (through testing & reading) that files with the same content can generate different hashes. I tested this via copy & paste (to ensure same amount of whitespace) with .txt, docx, and pdf files.


cloudnine.com/ediscove...practices/


I was so confident in ... the odds of any two dissimilar/different files having the same MD5 hash is one in 2^128 (340 billion billion billion billion); and the odds of any two dissimilar/different files having the same SHA1 hash is one in 2^160.


I was taught that file headers, filename, etc were not calculated in the value, so my question is ... 1. How is this possible and 2. Why are DFIR materials and Lecturers still teaching the FALSE doctrine?  
 
  

Bunnysniper
Senior Member
 

Re: SAME content DIFFERENT hash values - Explanation Plz

Post Posted: Oct 08, 19 12:17

Sorry, you did not understand the article you have mentioned. It says, that you can have the same text in a txt, docx and PDF file and get different hashes. And that is fine and true, because all these files have different headers, structures, meta information inside.

Think of a liter of water: you can put it into a glas, a bottle or a bath - it will always be 1 liter of water, but look very different.

But every time you hash a file and do not change it, you will always get the same hash. Otherwise hashing would be nonsense. What you mean might be the hash collision, but this occurs when different files generate the same hash. This is different from what you were writing about.

I would say you read some articles and books about forensic basics.

regards, Robin
_________________
--
All opinions are mine and are not necessarily the opinions of my employer. 
 
  

Rich2005
Senior Member
 

Re: SAME content DIFFERENT hash values - Explanation Plz

Post Posted: Oct 08, 19 15:03

The article you reference isn't talking about hash collisions.
It's talking about two completely different files having different hashes.
That's what you'd expect.
It doesn't matter if the text is the same.
I could write "the quick brown fox jumps over the lazy dog" on a piece of paper and then get someone else to do the same.
The TEXT would be the same but the document is different.
So, using the example from your link, creating a PDF from a Word Document is generating a completely different document.
It therefore would have a different hash.
The link you're referring to isn't wrong and I suspect neither is the material you're referring to. I've not seen DFIR/lecturers teaching nonsense (although that's possible obviously).

Hash collisions are something completely different.  
 
  

athulin
Senior Member
 

Re: SAME content DIFFERENT hash values - Explanation Plz

Post Posted: Oct 08, 19 15:19

- NeriMatrixx
I was taught that file headers, filename, etc were not calculated in the value, so my question is ... 1. How is this possible and 2. Why are DFIR materials and Lecturers still teaching the FALSE doctrine?


If you were indeed taught that as a general truth (and not something that was true only in special cases, such as for EnCase hashing EnCase image files, etc.), you have teachers who don't understand what they are teaching. If that's really is the case, it must be addressed.

But I would suggest starting at the other end. While it is difficult to assume that you are wrong, it is a useful approach, as you sooner or later have to convince someone that you are not committing similar error towards them.

Your question suggest that your understanding of what 'file content' means may not be entirely in line with what is meant when the term hashing is used. That may be a good place to start: how do standard hashing tool -- and more particularly the one you used for your tests -- work, in detail?

Once you know how your hashing tool works, check the file content (at the same level as the hashing tool works on) of the files you've been doing your tests with.  
 
  

benfindlay
Senior Member
 

Re: SAME content DIFFERENT hash values - Explanation Plz

Post Posted: Oct 08, 19 17:01

- NeriMatrixx
I was so confident in ... the odds of any two dissimilar/different files having the same MD5 hash is one in 2^128 (340 billion billion billion billion); and the odds of any two dissimilar/different files having the same SHA1 hash is one in 2^160.


No, the odds of an MD5 collision for 2 different files are I believe 2^64 and not 2^128, but still astronomically high. This is because odds of collision and total number of combinations are NOT the same thing.

- NeriMatrixx
I was taught that file headers, filename, etc were not calculated in the value, so my question is ... 1. How is this possible and 2. Why are DFIR materials and Lecturers still teaching the FALSE doctrine?


To refer to Brian Carrier's reference model, the only data included in the hash calculation is that which is classifed as being in the file's content category. Metadata like the filename and filesystem information like dates and times etc. are not a factor in the hash calculation. File headers (for clarity of terminology, by this we mean file signature/magic number) are because they are IN the file.

As alluded to by others, Word Docs etc. have other internal data present (like author details) which is not visible in the same manner as the file's textual content. This still classifies as "file content" in Carrier's model, but is perhaps more akin to being termed "embedded" or "internal" metadata.

Hope this helps,

Ben
_________________
Ben Findlay. BSc (Hons) MSc PgCLTHE FHEA MBCS MCSFS MIScT MInstISP
Course Leader BSc Computer and Digital Forensics
School of Science, Engineering and Design
Teesside University 
 
  

NeriMatrixx
Newbie
 

Re: SAME content DIFFERENT hash values - Explanation Plz

Post Posted: Oct 08, 19 23:38

- benfindlay

To refer to Brian Carrier's reference model, the only data included in the hash calculation is that which is classifed as being in the file's content category. Metadata like the filename and filesystem information like dates and times etc. are not a factor in the hash calculation. File headers (for clarity of terminology, by this we mean file signature/magic number) are because they are IN the file.

As alluded to by others, Word Docs etc. have other internal data present (like author details) which is not visible in the same manner as the file's textual content. This still classifies as "file content" in Carrier's model, but is perhaps more akin to being termed "embedded" or "internal" metadata.

Hope this helps,

Ben



Oh, ok. It's the definition of File Content that was not communicated to us properly (a group of us from class had tested it). We were taught that file content is the visible text within the file. Once file header and embedded data used to calculate hash value ... of course the hash will be difference. I will update my notes with this.

THANKS A MILLION!!!  
 
  

tracedf
Senior Member
 

Re: SAME content DIFFERENT hash values - Explanation Plz

Post Posted: Oct 08, 19 23:46

- benfindlay


No, the odds of an MD5 collision for 2 different files are I believe 2^64 and not 2^128, but still astronomically high. This is because odds of collision and total number of combinations are NOT the same thing.



The odds of two random files having the same MD5 hash is 1 in 2^128. Similarly, the odds of a file having the same hash as any particular file is 1 in 2^128. The difficulty of finding two files with the same hash, however, is 1 in ~2^64. The difference in the latter circumstance is that if we are trying to find *any* collision rather than a specific one, we don't care which two files match so we can hash many different files and look for any collision between them. This is referred to as the birthday problem or the birthday paradox.

If you were to ask people their birthday, you would have to ask over 180 people on average (assuming they all answer you) before you found someone with your birthday. But, you'd only have to ask 23 people on average before you found two people with the same birthday. In the first instance, you're matching to a specific birthday and there is only a 1 in 365 chance each time you ask. In the second, you are comparing each birthday to every other birthday and don't care if Person-1 matches Person 2, or Person-2 matches Person-3, or Person-3 matches Person-1, etc.  
 

Page 1 of 1