Hello everyone,
I have a question regarding metadata of PDF files: do copies of PDF files or their metadata contain any information that could theoretically be used to directly identify a specific user or device who/that used and copied the original PDF file (or the preceding copy that was used to create the available copy)?Â
Â
To clarify my question I'll give an example:Â
Person A creates a PDF file and uploads it to a website.
Person B downloads the file and emails it to Person C.
Person C sends the file to Person D.
Person D analyses the file to find out who else had access to the document. Would/could the (copy of the) file that person D received contain any information/metadata that could prove person B accessed and distributed the file that was later used to create Ds copy of it? Do pieces of information/metadata exist that would include Bs Username, computer name / device ID, MAC address or anything else relevant for the matter?
Â
Thanks in advance
ray
No. The most common definition of "the file" that Person D's version is hash-value identical to the one Person A created. If the two files are the same file, then no data is added or substracted.
One can imagine a workflow that involves tracking data being added to file properties, if the software was designed to do that. Such a system is certainly feasible. However, it would complicate the identification of Person D's version as the same Person A's version, since the file contents (internal PDF metadata) would have changed.
In the 90s, Lotus Notes was designed to manage a workflow, and it did exactly what I've described, by modifying fields in 1-2-3 spreadsheet files, for example, as it passed from person to person. But that's not what you described.
So, no.
@tuckerhst Thanks for the reply!Â
I have a follow-up question: If I understand you correctly, you are referring to pieces of metadata that are part of the file itself and would therefore alter the file's hash value when they are being edited. Is that correct?Â
But what about those pieces of metadata that are merely attached to the file and therefore wouldn't impact the hash value? I wonder whether such metadata could contain the information I described above.
I imagine a scenario in which the PDF's metadata would contain a column named for expample "derived from [MAC address]" or "Last accessed/edited by [device ID]" which would update itself just like the "last edited [date]"-property without altering the hash value.
Would this be possible using DocumentInfo or XMP metadata or would one need to try the approach you described above which would necessarily alter the hash value?Â
I imagine a scenario in which the PDF's metadata would contain a column named for expample "derived from [MAC address]" or "Last accessed/edited by [device ID]" which would update itself just like the "last edited [date]"-property without altering the hash value.
I meant: "just like the "last accessed [date]"-property".
The way file hashing works is a consequence of the way commonly used file systems and operating systems are designed. The data stream of the file is what's hashed, not its file system metadata (some of which is generally transferred along with it). So you can change a file's name, extension, and MAC dates and it still hashes the same.
PDF metadata is stored internally to the file. It changes the hash value. Of course you could change the designs and make them work differently. Within new file systems, you could extend file system metadata to include a usage log of the file, and within a strictly controlled closed system, ensure that file transfer services maintain the integrity of the usage log metadata, which would allow the file data itself to remain unchanged.
Good luck!
HI,
I was just scrolling and found this informative topic. Thanks for this valuable information.
Â
Thank you very much!Â
Your information was very helpful for me.
Â
So the take home message is that extracting some kind of user log info from a PDF file or its metadata would require special effort of altering that file beforehand, especially if copies of that file were to have the same hash value as the original.Â
Â
If anyone has any further thoughts or ideas on this topic I would be very interested in hearing them, but my initial question has been perfectly answered.
Â
Best regards
ray
Hello everyone,
Â
I recently found an interesting article (this one: https://www.sciencedirect.com/science/article/pii/S1742287619300234) that made me wonder whether the procedures described in it could be used for the purpose of accessing some kind of usage log of a PDF file in the manner described in my first post. I would be very interested in hearing your opinions about it - especially yours, @TuckerHST.
Â
First I'd like to summarize the parts of the article which are relevant to me: the authors focus on an approach of reconstructing user activity on NTFS file systems which includes the use of metadata that is stored in the system file $ObjId. They describe that this file includes an Index of ObjectIDs, which are created when the user of the device interacts with files. What's interesting to me is that some of those ObjectIDs are created when the user merely accesses the file and that they can in fact contain the MAC address of the device used for accessing them.Â
Â
I am aware that the authors focus on the possibility of analyzing activity that occurred on specific devices: if one possesses a thumb drive, one might analyze the $ObjId file of said device to reconstruct previous user activity involving the files saved on the device. I on the other hand am interested in the possibility of extracting some kind of a usage log from PDF files that have been sent/copied across devices. What made me excited reading the article was what @TuckerHST wrote in his previous post: that some system file metadata is transferred along with the file. I would like to know whether this transferred-along metadata includes the ObjectIDs.Â
Â
Extending the example given in my first post with what the authors described, I am interested in the following case:
Person A creates a PDF file and uploads it to a website.
Person B downloads the file and accesses it on an NTFS file system. This creates an ObjectID containing Person B's MAC address. Person B sends the PDF file to Person C.
Person C sends the file to Person D.
Would the ObjectID or any part of the $ObjId file have been trasferred along with the file, so that Person D could find out Person B's MAC address by analyzing the file or would any such information be stored solely on the used volume? Since @TuckerHST already wrote that file system metadata would need to be extended in order to include any kind of usage log I suspect the latter to be true. But because the concept of ObjectIDs is new and very interesting to me, I just have to ask specifically.Â
Â
Thanks in advance
ray
I think you are heading the wrong way. PDF files are ... just files. There's no 'environment' that would force creation of the kind of metadata you seem to be looking for.
The closest example I have encountered was a document management system (DMS) used by an attorney's office. Every document they received was checked into this DMS, and this caused a metadata entry to be added to the document. Updating the document change metadata, and finalizing the document, and sending it to a recipient also did similar things. (This was used to identify that the abstract document had existed before its current file creation time stamp ...)
But note that was inside a document management system. Outside such a system, it will be up to whatever and however those particular systems manage this metadata. The closest thing I've seen is the Windows alternate data stream that gets added on download. I am not aware of any web upload or download, or email functionality that actually tried to add or modify metadata of its operation.
If you want to research such systems for artifacts, I suggest you focus on the non-OS components you mention: web upload, download, mail, etc. Standard Operating system platforms are of such general interest that anything like this is likely to have been observed and documented a long time ago.
The PDF file format is documented, and the specification is available: either as an ISO standard, or possibly for free from Adobe (it's a long time since I tried to download it). That is, you can easily use that specification to ensure that you can access all parts of a PDF document (unless encrypted), and see of there is any 'foreign' information included.
However, several years ago such metadata was often something that ended up in the news, and most major software developers ensured that they provided methods for avoiding that kind of information leakage. For that reason, I'd consider it somewhat unlikely to have reappeared in something as well-established as PDF files. Rather the opposite: today, people are making money of finding security vulnerabilities (including information security vulnerabilities), and would be rather likely to discover and report such leakages as you have mind comparatively quickly .
Â
@athulinÂ
Thank you very much!