Could two seemingly...
 
Notifications
Clear all

Could two seemingly identical PDF files contain different information?

Page 1 / 2
ray m
(@ray-m)
New Member

Hello everyone,

I have a question regarding the reliability of PDF files: Would it be possible to create two (or more) seemingly identical PDF files that actually contain different information?

What actually interests me is the question just how reliable hash values, metadata and byte-by-byte comparisons are when analyzing PDF files.

To explain my scepticism I would like you to imagine a scenario like the following one: copies of an already existing PDF file would be edited in such a way that they all contain an encyrpted section which - when encyrpted - looks identical in all the copies but could actually contain unique information. The differing information would of course only be visible after decryption. Furthermore, the encrypted section would not only look identical "from the outside" (when encrypted) but would also be added to the exact same part of the document in every copy to ensure the edited "copies" seem identical, and the differing information inside the encyrpted sections would be of the same length to not create files with different sizes. Also, the program used to edit the documents would be designed in such a way that it would not change the (application) metadata.

If I would receive such "copies" without having access to the system they were created on I could of course not look at the file system metadata to notice that they are in fact not identical. Also, the hash values would be identical in every file because the encrypted sections would conceal the differences. And additionally, a byte-by-byte comparison of the "copies" would not reveal any differences since - without decryption - they would seem identical. 

Would such a scenario be possible? Or can I be confident that considering metadata, hash values and the results of byte-by-byte comparisons is enough to determine with certainty whether two files are identical or not?

 

Thanks in advance

ray m

This topic was modified 1 month ago by ray m
Quote
Topic starter Posted : 04/07/2022 8:24 pm
athulin
(@athulin)
Community Legend

Without a definition of what 'seemingly' refers to its difficult to answer. Is it about size? Metadata? Order of contents? About what a PDF viewer shows you? Or about something else? For any type of 'seeming' in its probably possible to have some way in which it is glaringly obvious that there is a change.

You ask "can I be confident that considering metadata, hash values and the results of byte-by-byte comparisons is enough to determine with certainty whether two files are identical or not". 

Under a suitable definition of 'file', you can be certain. (Consider the NTFS definition of file, which includes alternate data streams.)

But ... you can't be certain that the interpretation of that data will be the same everywhere.

What you describe is (as far as I can judge) about implicit or indirect data, not present in the document, affecting interpretation. Your variously encrypted streams clearly require some information that says just what encryption method and parameters should be used for decryption. If that data is in the document, it affects hashing; if it is outside, stored in a file or in a registry or in a DLL file, hashing the document file won't help you identify it.

You can have a similar situation with Word documents.  In their intended environment, say at company BestAcme, all Word installations refer to BestAcme corporate libraries of Word Basic code that provide additional functionality. Outside it (such as with a forensic analyst) that code is not present, and so the reader may not see it. (I've run into that once -- the company helpdesk said I should look at the info in a window, but as I was outside their company IT environment, it just wasn't there. I had to get a corporate laptop to get at that info.)

Or ... more obvious, documents referring to something like the web site www.example.com.  Normally, that leads to a dead end, but if your local DNS set up resolves www.example.com to your secret contraband file storage ... .

Or, to take a less obvious example: ISO images.  The ISO 9660 standard represents file offsets by two integers, one in little-endian format, the other in big-endian format. If the receiving system doesn't verify that they are the same, you can have "readme.txt" resolve to one text when you use a big-endian 'mounter'  and to another text when you use a little-endian 'mounter'.

You have to keep 'meaning', or whatever you refer to by the word 'seeming', separate from 'content'. And you also need to know if your definition of 'content' is the same as that used by your byte-by-byte comparer or file hasher ... or live imager. 

ReplyQuote
Posted : 05/07/2022 6:46 am
ray m and mattb90 liked
ray m
(@ray-m)
New Member

@athulin 

Thank you very much for the reply!

I would like to keep the definition of "seemingly identical files" rather broad: files that contain hidden, unique information, which is impossible to access or even get a trace of its existence as someone who wasn't provided with specific tools and knowledge (e.g. a decryption key) to do so from the person or organization that hid that information.

Since this definition is pretty much tautological (sorry) I'll give an example of what I have in mind:

Someone intends to send hidde messages to different people. Every person should reveice a unique message, but the PDF file that each and every person receives should appear to be the same for all of them. The files' contents could appear to be just an advertisement brochure when opened with a PDF reader. Every page of the PDF file would look identical to the human eye, every line of code when opened raw would look identical, hash values, meta data etc. would be identical as well for all the copies (in case someone unauthorized would get access to those "copies" and started to look for hidden communication). 

Your Word document anecdote sound very interesting! Was there really no trace of the file's additional information outside of the company network or was it "just" impossible to access it? What interests me is: would adding/altering such information, although it's not accessible/visible, affect the hash value for example? 

This post was modified 1 month ago by ray m
ReplyQuote
Topic starter Posted : 05/07/2022 8:28 pm
C.R.S.
(@c-r-s)
Active Member

Hi,

Byte-by-byte comparison is foolproof. Like anything in the world of computers, a file is a large number. If the two numbers are mathematically equal, the files are the same.

You can use a hash algorithm to reduce the files to smaller numbers and compare those. Due to the reduction, there is an inevitable probability of hash collisions. However, the natural probability that the hashed files are not the same, if the hash values are the same, is extremely low.

If you have a sufficiently efficient way to calculate a type of hash, or the algorithm is otherwise broken, and your target file type (like PDF) tolerates padding, you can determine contents to pad one of two different files, so that the algorithm generates the same hash value for the two files. If the file type does not tolerate padding, it is obviously even more difficult to compute two different functional and intended file contents that result in the same hash value. Hower, the files are verifiably different in both cases.

If you want to send actually the same file to different people and achieve different computer outputs, you need to leverage a context that is specific to the recipients. You can use file features, like the fact that a printed PDF output can be different from its screen display, if one person tends to print the PDF and the other one to examine it on screen. Or use time zones or other available system properties to distinguish between recipients. However, such techniques will be more or less apparent to anyone who has access to the file.

Or you can use encryption algorithms and keys as cryptographic contexts: In the most simple case, you embed a concatenated series of messages, each element of which is encrypted with a different key. If each recipient only holds one of the keys, they can decrypt only the element that is intended for them. Of course, there can be data integrity issues here, but it can also be made as complex as you like to solve such problems, using nested encryption, asymmetric encryption/signatures etc.
If the tell-tale high entropy of encrypted data needs to be avoided, more traditional "spy communication" could be used, i.e. different decoding patterns for each recipient can be applied to the natural PDF contents (text, color codes etc.).

ReplyQuote
Posted : 05/07/2022 11:59 pm
ray m liked
athulin
(@athulin)
Community Legend

@ray-m "Someone intends to send hidden messages to different people. "

You could probably do that with PDF, as you can write (limited) code to execute on read or other actions.  That code could have an 'if this is recipient 1 show this, if it is recipient 2 do that' ... but in normal PDF that would be visible.  By encrypting the code (or the stream in which the code appears) you can 'hide' it, but then you have a pretty big red flag: this PDF contains encrypted contents that needs a user password to decrypt. That could trigger an examination of local PDF key stores (for automatic decryption) or lead to the user being asked for the password. You might do something clever with a PDF extension, but it would still be visible.  Perhaps not to a naive user who never looks inside a PDF file, but to a forensic analyst specializing on PDFs, it would be pretty plain. (I'm not one, but I have followed PDF since it was introduced, so I think I have a pretty good idea of the general capabilities ... or I like to think that I do ...).

The best way to approach this particular area is to get the Adobe PDF Reference Manual and read it. The latest version is an ISO standard (and so requires payment), but the 1.7 release is available for free at archive.org ( https://archive.org/details/pdf1.7). For full details you may have to sign up as an Adobe PDF developer or something like that.

"Your Word document anecdote sound very interesting! Was there really no trace of the file's additional information outside of the company network or was it "just" impossible to access it?"

No, there were traces.  But you had to know about them before hand, and for that you needed to take the .doc container apart. I'm sure that there are tools out there that detect and report that kind of stuff ... so to some it would have been glaringly obvious, and to anyone who fingerprinted the document as something produced by this document management system, the metadata was fairly easily to extract. In this case, it was part of .doc container, so altering it would have changed the hash.  However, if the data had been stored in an NTFS alternate data stream, it wouldn't, unless you used a hashing tool that hashed everything that's part of what NTFS calls a file. 

 

ReplyQuote
Posted : 06/07/2022 6:39 am
ray m liked
ray m
(@ray-m)
New Member

Thanks for the replies, @c-r-s and @athulin! They were very helpful!

So, a byte-by-byte comparison of two seemingly identical files should suffice to detect the existence of any differences in pieces of information (including file content, hidden communication, encrypted sections, unique identifiers that are added to the file's (non-system-)metadata, ...) that would be transferred along with (or rather: as part of) those files when they are being copied/sent from one system to another. Is that correct?

The only exception from this appears to be information that could be included inside the alternate data streams that @athulin mentioned. After reading athulin's very interesting post I looked for more information about ADS and found that not only could files be hidden inside those streams but with some effort apparently also the streams themselves could be hidden from windows, antivirus software and forensic tools. 

However, ADS can only be saved on NTFS and are therefore lost when they are not directly transferred from one NTFS device to another. There appears to be workarounds for this issue but simply sending a file via email would cause the ADS to be lost. At least this was the case when I just tried it. Was this because uploading/downloading an email attachment does not "mimic" a direct NTFS-to-NTFS transfer or was it just an issue specific to my system or email provider?

This post was modified 1 month ago by ray m
ReplyQuote
Topic starter Posted : 07/07/2022 7:06 pm
athulin
(@athulin)
Community Legend

@ray-m "... simply sending a file via email would cause the ADS to be lost. At least this was the case when I just tried it. Was this because uploading/downloading an email attachment does not "mimic" a direct NTFS-to-NTFS transfer or was it just an issue specific to my system or email provider?"

It was probably not specific to your system. Your system, after all, allows application software to identify and access ADSes: that's built into Windows (the kernel, not the GUI Shell.)

And email providers just receive email and passes  it on, as long as  it is well-formatted.

But ADSes are not really intended to be user-level resources: they are more application- or system-level (the ADS added by most web browsers to downloaded files is a system-level resource), and so not necessarily of any interest to automatically include in emails. (They might contain personal information, for one.)  So it's more up to the email software you use (directly or indirectly): how does it handle files with ADSes?

Probably not at all -- as there is no support for ADS in standard mail protocols, and it doesn't know if there any common extensions that the *receiving* email software can use, and so it is likely to be more a source of confusion and complaint  than of utility. (Never a good idea to build in sources for confusion in your software). Instead, the user will have to use an ADS-capable file archive program to create a file archive file, and then send that as an attachment. Some email software may allow you to select a file archive program and provide command like options or equivalents for use when you drag files into the mail as attachments.  Of course, the recipient has to have the same software at his end.

You have a similar situation with Macs (or has it been dropped?), where files had one data fork and one resource fork.  If you wanted to transfer both those, a fork-capable archiver such as Stuffit! was required at both ends of the transfer. (See https://en.wikipedia.org/wiki/Resource_fork#Compatibility_problems)

ReplyQuote
Posted : 08/07/2022 6:33 am
ray m liked
C.R.S.
(@c-r-s)
Active Member

The meanings vary between descriptions of different technical processes, but generally a file should not be confused with its contextualized file system representations. IMHO a file is the "payload" bytestream of a file system record that is returned to an application for a read operation.

An NTFS feature allows to store multiple of such payloads for a single file system record - the primary data stream and the alternate data streams. Everything around those data streams, or files, is file system metadata.

There is no need to store files in file systems, it's just convenience. You can create a PDF processor (and run it with elevated privileges to bypass the OS restrictions) that writes files to defined LBA blocks on blank disks. However, to make that repeatable, you will need to define rules for selecting the blocks, which then turn into their own file system. Using their own file systems on unpartitioned space is exactly how some malware can exfiltrate data via storage devices.

If a file system resides on a storage device (or partition, to which file systems restrict themselves to, in order to allow concurrent use) its features provide specific options to embed information more or less openly. These channels are only effective if the file system itself, on the storage device or on block-level copies, is transferred to the recipients of the information.

Just like a comprehensive functional examination of a file will uncover any techniques to hide information in the file, and the altered file will be bytewise different from the unaltered one, the same applies to the file system.

ReplyQuote
Posted : 09/07/2022 6:54 pm
ray m liked
ray m
(@ray-m)
New Member

Thank you very much, @athulin and @c-r-s!

What C.R.S. describes is very interesting and new to me. I'm not sure if I fully understand it and would like to ask you to answer some questions I have regarding that matter.

1. Creating a file system on unpartitioned space as C.R.S. describes appears to be a way to store information which is associated with a file outside of that file itself - basically as file system data. Is that correct? 

2. The creation of a PDF processor which works in the way C.R.S. describes would be necessary to initiate the process of storing that information on the unpartitioned space, right? 

3. Would it be possible that the process of saving a file which is sent as an email attachment, could include unwittingly transferring also the file system (maybe as a virtualized storage device or block-level copies) on which the additional information is stored to one's own machine?

4. If I understand you correctly, it would be possible that a PDF file (which has been altered by a PDF processor in the manner you described) could, after being sent to another device, create its own file system "around itself" like malware and store information on that file system. Or would it be necessary to have the described PDF processor installed and working on the device on which the file system is supposed to be created / transferred to?

 

Thanks in advance! You have already been of great help thus far.

ReplyQuote
Topic starter Posted : 09/07/2022 9:31 pm
C.R.S.
(@c-r-s)
Active Member
Posted by: @ray-m

1. Creating a file system on unpartitioned space as C.R.S. describes appears to be a way to store information which is associated with a file outside of that file itself - basically as file system data. Is that correct? 

It depends what you mean by "associated". You started your questions with the problem of transmitting information to different people in files. No file system feature is associated to files in this sense. Imagine how bad that would be: You created a PDF years ago under Linux on an ext2 parition and want to read it now under Windows 10 from a ReFS storage space. Fortunately, the file will not bear any traces on which file systems it was stored during its life, and only the reader application must be compatible.

I wanted to highlight that a file system is just a specification of how to store files on media. You can apply it to any media context: A "friendly" file system will respect the concept of partitions and be created within those, but you can create rules how to store data outside partitions as well. A modern production file system has a ton of features that require it to store file system metadata; a simplistic and clandestine way of storing information can as well be based on rules that do not require stored file system metadata. Another option, typically used by malware, is to apply an overlay file system on top of the regular file system that is found on the machine. Then the stored data structure is not evident from accessing just the original file system. You are free to choose the storage context (what a parition is for a regular file system) when develping an overlay file system: It can be files or any file system metadata that the intended host file systems support or a combination.

ReplyQuote
Posted : 09/07/2022 10:08 pm
ray m liked
ray m
(@ray-m)
New Member

@c-r-s 

Thank you very much for the explaination! I appreciate your time and effort a lot.

I have one more question regarding file system metadata: Reading the first paragraph of your latest post I would assume that sending a PDF file to another device without making any effort of also transferring the file system the file is stored on would cause any file system metadata to be lost (or rather: remain on the device the file is being sent from and not be copied to the recipient's device). Is that correct? 

If so, this seems to be conflicting with something the user @tuckerhst wrote on another topic about PDF metadata (here: https://www.forensicfocus.com/forums/general/pdf-metadata-2/#post-6606330 ). In the fifth post he writes that some file system metadata is usually transferred along with the file (the topic is about the scenario of sending a PDf file via email to a different device). Do you happen to know what data he could mean by that, and is this really conflicting with what you wrote? 

 

ReplyQuote
Topic starter Posted : 10/07/2022 9:02 pm
C.R.S.
(@c-r-s)
Active Member

@ray-m To be precise, name and extension of a file are file system metadata that is usually transferred with a file, e.g. by a mail client. If an application wants to create a new file, it needs to provide this information, hence it makes sense for the application to enforce certain extensions and to suggest a name to the user, e.g. the extension that matches the native file type of the applicaiton or the name that the email attachment originally had and was transmitted in the email. As this requirement is very generic, it gives no hint to which file system the file had been stored on.

Do you mean this sentence?

...and within a strictly controlled closed system, ensure that file transfer services maintain the integrity of the usage log metadata...

I don't think he means here that other file system metadata is "usually" transferred, as he talks about options for new file systems. At least it is not the case via email.

Think of it this way: The file system metadata is maintained by the file system and not the user application that accesses a file. It depends entirely on the permissions that the application runs with, which file system metadata it can request to read or to change. Due to purposes of file system metadata, e.g. to display when a file has been created on a file system, a user mode application does not have permissions to request changes to such data.

Of course, there are many cases where you need to replicate file system metadata and you use appropriate tools and permission settings to allow that. For example, if you migrate enterprise storage, you want to keep all the ACLs and time stamps, archive bits etc. and can use e.g. robocopy to read the meta data from the source file system and set it in the destination file system. However, this is not the way how files are regularly exchanged between users.

ReplyQuote
Posted : 11/07/2022 12:57 pm
ray m liked
ray m
(@ray-m)
New Member

@c-r-s Thanks again!

I actually meant another part of what tuckerhst wrote, but your reply nevertheless covers perfectly what interested me.

 

In the past few days I've read about something else that made me curious: so called extended file attributes. I haven't found a lot of information about that topic, and some of the things I've read seem contradicting to me. The basic idea of these EFAs seems to be that they enable associating file system metadata that's not being interpreted by the file system with files. Also they are supposed to work across file systems and operating systems. 

Considering these points EFAs appear to be a way of associating a piece of information (possibly some sort of ID) with a file without altering that files' content. If they would indeed work in a cross-FS way I would expect them to be preserved when a file is being sent across devices - so this method should enable transferring information while at the same time not being detectable by simply analyzing the file. 

However, from some sources I've read I got the impression that they work slightly different (or the term "extended file attribute" might even refer to a different concept) on different operating systems - while other sources cleary gave the opposite impression. Also, there doesn't seem to be a way of displaying or editing EFAs on Windows. 

Despite the idea of cross-OS and -FS compatibility I've even read somewhere that simply saving a file on a non-NTFS-storage or zipping and unzipping would cause EFAs to be deleted. I did some experimenting with a Linux command to add and alter EFAs (getfattr, setfattr) and found indeed that sending a file via email or zipping/unzipping it caused the added EFAs to be lost. Now, while that's in line with what you, @c-r-s, wrote in the second last paragraph of your latest reply about metadata in general, it contradicts the idea of how specifically EFAs seem to be supposed to work. 

Could someone please explain these extended file attributes to me?

I would be specifically interested in

1. whether the term describe the same concept across different operating systems.

2. whether they would work in the way described above to transmit information via email.

3. how one could display or edit them on Windows.

ReplyQuote
Topic starter Posted : 12/07/2022 6:38 pm
C.R.S.
(@c-r-s)
Active Member
Posted by: @ray-m

1. whether the term describe the same concept across different operating systems.

I wouldn't say so. They are "extended" in the sense that they add features which aren't required by the file system - a difficult categorization, because they still are necessarily implemented in the file system and either supported or not. Therfore, they are as specific to the file system as any other file system metadata, even though compatibility may exist between different file systems.

Compatibility means that two file systems can hold the same metadata for a file. A software for transferring files is most likely to support that compatibility, if two file systems are mounted locally. If you have to go through a network protocol/file system or an application encoding, like email, you are facing their specific limitations. You can pass practically any data via any channel, if software on both ends works hand in hand to do so. But a regular email client is designed to encode only the file data stream and attach it together with information on the original file name and extension.

A good example for EFAs is the zone identifier ADS on NTFS that is interpreted the Windows Explorer or Office. I consider that a niche application with a purely local function. Wikipedia lists several other purposes of EFAs, which don't seem practical or commonplace to me:
Storing an author? Relevant file types either can store this internal metadata, or, if you create a DMS that needs additional data fields, it is preferable to use a file system-agnostic overlay, e.g. stored in a database, in order to support multiple platforms.
Are there file systems that store checksums for application use? I don't know. But if we are talking about ZFS or ReFS integrity streams, they are not "extended" but interpreted by the file system, because you don't want the file system to return incorrect data in general and not leave the checks to each individual application.

ReplyQuote
Posted : 18/07/2022 4:53 pm
ray m liked
ray m
(@ray-m)
New Member

@c-r-s 

Thank you! Your replies have been very helpful for me.

ReplyQuote
Topic starter Posted : 20/07/2022 11:37 am
Page 1 / 2
Share:
Share to...