Notifications

Clear all

Could two seemingly identical PDF files contain different information?

Page 1 / 2 Next

General (Technical, Procedural, Software, Hardware etc.)

Last Post by ray m 4 years ago

16 Posts

3 Users

9 Reactions

5,562 Views

RSS

ray m

(@ray-m)

Eminent Member

Joined: 4 years ago

Posts: 18

Topic starter 04/07/2022 8:24 pm [#19625]

Hello everyone,

I have a question regarding the reliability of PDF files: Would it be possible to create two (or more) seemingly identical PDF files that actually contain different information?

What actually interests me is the question just how reliable hash values, metadata and byte-by-byte comparisons are when analyzing PDF files.

To explain my scepticism I would like you to imagine a scenario like the following one: copies of an already existing PDF file would be edited in such a way that they all contain an encyrpted section which - when encyrpted - looks identical in all the copies but could actually contain unique information. The differing information would of course only be visible after decryption. Furthermore, the encrypted section would not only look identical "from the outside" (when encrypted) but would also be added to the exact same part of the document in every copy to ensure the edited "copies" seem identical, and the differing information inside the encyrpted sections would be of the same length to not create files with different sizes. Also, the program used to edit the documents would be designed in such a way that it would not change the (application) metadata.

If I would receive such "copies" without having access to the system they were created on I could of course not look at the file system metadata to notice that they are in fact not identical. Also, the hash values would be identical in every file because the encrypted sections would conceal the differences. And additionally, a byte-by-byte comparison of the "copies" would not reveal any differences since - without decryption - they would seem identical.

Would such a scenario be possible? Or can I be confident that considering metadata, hash values and the results of byte-by-byte comparisons is enough to determine with certainty whether two files are identical or not?

Thanks in advance

ray m

Quote

Anonymous 6593

(@Anonymous 6593)

Joined: 18 years ago

Posts: 1158

05/07/2022 6:46 am

Without a definition of what 'seemingly' refers to its difficult to answer. Is it about size? Metadata? Order of contents? About what a PDF viewer shows you? Or about something else? For any type of 'seeming' in its probably possible to have some way in which it is glaringly obvious that there is a change.

You ask "can I be confident that considering metadata, hash values and the results of byte-by-byte comparisons is enough to determine with certainty whether two files are identical or not".

Under a suitable definition of 'file', you can be certain. (Consider the NTFS definition of file, which includes alternate data streams.)

But ... you can't be certain that the interpretation of that data will be the same everywhere.

What you describe is (as far as I can judge) about implicit or indirect data, not present in the document, affecting interpretation. Your variously encrypted streams clearly require some information that says just what encryption method and parameters should be used for decryption. If that data is in the document, it affects hashing; if it is outside, stored in a file or in a registry or in a DLL file, hashing the document file won't help you identify it.

You can have a similar situation with Word documents. In their intended environment, say at company BestAcme, all Word installations refer to BestAcme corporate libraries of Word Basic code that provide additional functionality. Outside it (such as with a forensic analyst) that code is not present, and so the reader may not see it. (I've run into that once -- the company helpdesk said I should look at the info in a window, but as I was outside their company IT environment, it just wasn't there. I had to get a corporate laptop to get at that info.)

Or ... more obvious, documents referring to something like the web site www.example.com. Normally, that leads to a dead end, but if your local DNS set up resolves www.example.com to your secret contraband file storage ... .

Or, to take a less obvious example: ISO images. The ISO 9660 standard represents file offsets by two integers, one in little-endian format, the other in big-endian format. If the receiving system doesn't verify that they are the same, you can have "readme.txt" resolve to one text when you use a big-endian 'mounter' and to another text when you use a little-endian 'mounter'.

You have to keep 'meaning', or whatever you refer to by the word 'seeming', separate from 'content'. And you also need to know if your definition of 'content' is the same as that used by your byte-by-byte comparer or file hasher ... or live imager.

ray m and mattb90 reacted

ReplyQuote

ray m

(@ray-m)

Eminent Member

Joined: 4 years ago

Posts: 18

Topic starter 05/07/2022 8:28 pm

@athulin

Thank you very much for the reply!

I would like to keep the definition of "seemingly identical files" rather broad: files that contain hidden, unique information, which is impossible to access or even get a trace of its existence as someone who wasn't provided with specific tools and knowledge (e.g. a decryption key) to do so from the person or organization that hid that information.

Since this definition is pretty much tautological (sorry) I'll give an example of what I have in mind:

Someone intends to send hidde messages to different people. Every person should reveice a unique message, but the PDF file that each and every person receives should appear to be the same for all of them. The files' contents could appear to be just an advertisement brochure when opened with a PDF reader. Every page of the PDF file would look identical to the human eye, every line of code when opened raw would look identical, hash values, meta data etc. would be identical as well for all the copies (in case someone unauthorized would get access to those "copies" and started to look for hidden communication).

Your Word document anecdote sound very interesting! Was there really no trace of the file's additional information outside of the company network or was it "just" impossible to access it? What interests me is: would adding/altering such information, although it's not accessible/visible, affect the hash value for example?

ReplyQuote

C.R.S.

(@c-r-s)

Estimable Member

Joined: 15 years ago

Posts: 170

05/07/2022 11:59 pm

Hi,

Byte-by-byte comparison is foolproof. Like anything in the world of computers, a file is a large number. If the two numbers are mathematically equal, the files are the same.

You can use a hash algorithm to reduce the files to smaller numbers and compare those. Due to the reduction, there is an inevitable probability of hash collisions. However, the natural probability that the hashed files are not the same, if the hash values are the same, is extremely low.

If you have a sufficiently efficient way to calculate a type of hash, or the algorithm is otherwise broken, and your target file type (like PDF) tolerates padding, you can determine contents to pad one of two different files, so that the algorithm generates the same hash value for the two files. If the file type does not tolerate padding, it is obviously even more difficult to compute two different functional and intended file contents that result in the same hash value. Hower, the files are verifiably different in both cases.

If you want to send actually the same file to different people and achieve different computer outputs, you need to leverage a context that is specific to the recipients. You can use file features, like the fact that a printed PDF output can be different from its screen display, if one person tends to print the PDF and the other one to examine it on screen. Or use time zones or other available system properties to distinguish between recipients. However, such techniques will be more or less apparent to anyone who has access to the file.

Or you can use encryption algorithms and keys as cryptographic contexts: In the most simple case, you embed a concatenated series of messages, each element of which is encrypted with a different key. If each recipient only holds one of the keys, they can decrypt only the element that is intended for them. Of course, there can be data integrity issues here, but it can also be made as complex as you like to solve such problems, using nested encryption, asymmetric encryption/signatures etc.
If the tell-tale high entropy of encrypted data needs to be avoided, more traditional "spy communication" could be used, i.e. different decoding patterns for each recipient can be applied to the natural PDF contents (text, color codes etc.).

ray m reacted

ReplyQuote

Anonymous 6593

(@Anonymous 6593)

Joined: 18 years ago

Posts: 1158

06/07/2022 6:39 am

@ray-m "Someone intends to send hidden messages to different people. "

You could probably do that with PDF, as you can write (limited) code to execute on read or other actions. That code could have an 'if this is recipient 1 show this, if it is recipient 2 do that' ... but in normal PDF that would be visible. By encrypting the code (or the stream in which the code appears) you can 'hide' it, but then you have a pretty big red flag: this PDF contains encrypted contents that needs a user password to decrypt. That could trigger an examination of local PDF key stores (for automatic decryption) or lead to the user being asked for the password. You might do something clever with a PDF extension, but it would still be visible. Perhaps not to a naive user who never looks inside a PDF file, but to a forensic analyst specializing on PDFs, it would be pretty plain. (I'm not one, but I have followed PDF since it was introduced, so I think I have a pretty good idea of the general capabilities ... or I like to think that I do ...).

The best way to approach this particular area is to get the Adobe PDF Reference Manual and read it. The latest version is an ISO standard (and so requires payment), but the 1.7 release is available for free at archive.org ( https://archive.org/details/pdf1.7). For full details you may have to sign up as an Adobe PDF developer or something like that.

"Your Word document anecdote sound very interesting! Was there really no trace of the file's additional information outside of the company network or was it "just" impossible to access it?"

No, there were traces. But you had to know about them before hand, and for that you needed to take the .doc container apart. I'm sure that there are tools out there that detect and report that kind of stuff ... so to some it would have been glaringly obvious, and to anyone who fingerprinted the document as something produced by this document management system, the metadata was fairly easily to extract. In this case, it was part of .doc container, so altering it would have changed the hash. However, if the data had been stored in an NTFS alternate data stream, it wouldn't, unless you used a hashing tool that hashed everything that's part of what NTFS calls a file.

ray m reacted

ReplyQuote

ray m

(@ray-m)

Eminent Member

Joined: 4 years ago

Posts: 18

Topic starter 07/07/2022 7:06 pm

Thanks for the replies, @c-r-s and @athulin! They were very helpful!

So, a byte-by-byte comparison of two seemingly identical files should suffice to detect the existence of any differences in pieces of information (including file content, hidden communication, encrypted sections, unique identifiers that are added to the file's (non-system-)metadata, ...) that would be transferred along with (or rather: as part of) those files when they are being copied/sent from one system to another. Is that correct?

The only exception from this appears to be information that could be included inside the alternate data streams that @athulin mentioned. After reading athulin's very interesting post I looked for more information about ADS and found that not only could files be hidden inside those streams but with some effort apparently also the streams themselves could be hidden from windows, antivirus software and forensic tools.

However, ADS can only be saved on NTFS and are therefore lost when they are not directly transferred from one NTFS device to another. There appears to be workarounds for this issue but simply sending a file via email would cause the ADS to be lost. At least this was the case when I just tried it. Was this because uploading/downloading an email attachment does not "mimic" a direct NTFS-to-NTFS transfer or was it just an issue specific to my system or email provider?

ReplyQuote

Anonymous 6593

(@Anonymous 6593)

Joined: 18 years ago

Posts: 1158

08/07/2022 6:33 am

@ray-m "... simply sending a file via email would cause the ADS to be lost. At least this was the case when I just tried it. Was this because uploading/downloading an email attachment does not "mimic" a direct NTFS-to-NTFS transfer or was it just an issue specific to my system or email provider?"

It was probably not specific to your system. Your system, after all, allows application software to identify and access ADSes: that's built into Windows (the kernel, not the GUI Shell.)

And email providers just receive email and passes it on, as long as it is well-formatted.

But ADSes are not really intended to be user-level resources: they are more application- or system-level (the ADS added by most web browsers to downloaded files is a system-level resource), and so not necessarily of any interest to automatically include in emails. (They might contain personal information, for one.) So it's more up to the email software you use (directly or indirectly): how does it handle files with ADSes?

Probably not at all -- as there is no support for ADS in standard mail protocols, and it doesn't know if there any common extensions that the *receiving* email software can use, and so it is likely to be more a source of confusion and complaint than of utility. (Never a good idea to build in sources for confusion in your software). Instead, the user will have to use an ADS-capable file archive program to create a file archive file, and then send that as an attachment. Some email software may allow you to select a file archive program and provide command like options or equivalents for use when you drag files into the mail as attachments. Of course, the recipient has to have the same software at his end.

You have a similar situation with Macs (or has it been dropped?), where files had one data fork and one resource fork. If you wanted to transfer both those, a fork-capable archiver such as Stuffit! was required at both ends of the transfer. (See https://en.wikipedia.org/wiki/Resource_fork#Compatibility_problems)

ray m reacted

ReplyQuote

C.R.S.

(@c-r-s)

Estimable Member

Joined: 15 years ago

Posts: 170

09/07/2022 6:54 pm

The meanings vary between descriptions of different technical processes, but generally a file should not be confused with its contextualized file system representations. IMHO a file is the "payload" bytestream of a file system record that is returned to an application for a read operation.

An NTFS feature allows to store multiple of such payloads for a single file system record - the primary data stream and the alternate data streams. Everything around those data streams, or files, is file system metadata.

There is no need to store files in file systems, it's just convenience. You can create a PDF processor (and run it with elevated privileges to bypass the OS restrictions) that writes files to defined LBA blocks on blank disks. However, to make that repeatable, you will need to define rules for selecting the blocks, which then turn into their own file system. Using their own file systems on unpartitioned space is exactly how some malware can exfiltrate data via storage devices.

If a file system resides on a storage device (or partition, to which file systems restrict themselves to, in order to allow concurrent use) its features provide specific options to embed information more or less openly. These channels are only effective if the file system itself, on the storage device or on block-level copies, is transferred to the recipients of the information.

Just like a comprehensive functional examination of a file will uncover any techniques to hide information in the file, and the altered file will be bytewise different from the unaltered one, the same applies to the file system.

ray m reacted

ReplyQuote

ray m

(@ray-m)

Eminent Member

Joined: 4 years ago

Posts: 18

Topic starter 09/07/2022 9:31 pm

Thank you very much, @athulin and @c-r-s!

What C.R.S. describes is very interesting and new to me. I'm not sure if I fully understand it and would like to ask you to answer some questions I have regarding that matter.

1. Creating a file system on unpartitioned space as C.R.S. describes appears to be a way to store information which is associated with a file outside of that file itself - basically as file system data. Is that correct?

2. The creation of a PDF processor which works in the way C.R.S. describes would be necessary to initiate the process of storing that information on the unpartitioned space, right?

3. Would it be possible that the process of saving a file which is sent as an email attachment, could include unwittingly transferring also the file system (maybe as a virtualized storage device or block-level copies) on which the additional information is stored to one's own machine?

4. If I understand you correctly, it would be possible that a PDF file (which has been altered by a PDF processor in the manner you described) could, after being sent to another device, create its own file system "around itself" like malware and store information on that file system. Or would it be necessary to have the described PDF processor installed and working on the device on which the file system is supposed to be created / transferred to?

Thanks in advance! You have already been of great help thus far.

ReplyQuote

C.R.S.

(@c-r-s)

Estimable Member

Joined: 15 years ago

Posts: 170

09/07/2022 10:08 pm

Posted by: @ray-m

1. Creating a file system on unpartitioned space as C.R.S. describes appears to be a way to store information which is associated with a file outside of that file itself - basically as file system data. Is that correct?

It depends what you mean by "associated". You started your questions with the problem of transmitting information to different people in files. No file system feature is associated to files in this sense. Imagine how bad that would be: You created a PDF years ago under Linux on an ext2 parition and want to read it now under Windows 10 from a ReFS storage space. Fortunately, the file will not bear any traces on which file systems it was stored during its life, and only the reader application must be compatible.

I wanted to highlight that a file system is just a specification of how to store files on media. You can apply it to any media context: A "friendly" file system will respect the concept of partitions and be created within those, but you can create rules how to store data outside partitions as well. A modern production file system has a ton of features that require it to store file system metadata; a simplistic and clandestine way of storing information can as well be based on rules that do not require stored file system metadata. Another option, typically used by malware, is to apply an overlay file system on top of the regular file system that is found on the machine. Then the stored data structure is not evident from accessing just the original file system. You are free to choose the storage context (what a parition is for a regular file system) when develping an overlay file system: It can be files or any file system metadata that the intended host file systems support or a combination.

ray m reacted

ReplyQuote

Page 1 / 2 Next