by Harold Burt-Gerrans
Let’s continue from where we left off last time, discussing standardization. If you missed it, Part 1 was all about establishing standards. Now a bit about following standards. This will sound funny to those who know what a rebel I tend to be! Watch out, I’m about to rant…
When there are established standards, they should be followed. I’m mentioning this as a particular issue has caused some grief on a couple of recent occasions and I’ve discovered that a few industry-leading processing software applications have adopted this particular deviation from the standard. For reference supporting my arguments, I’m using the Library of Congress.
For those of you who are non-technical, an EML file is a text file containing an email message and its attachments. An MBOX file is a file containing one or more EML messages, each preceded by a separator line. MBOX is the format for messages exported by applications such as Gmail, when exported from their Vault and Personal Data Backup functions. Yes, I am aware that Gmail also has functions to export as PST, but not every item in Gmail converts to MSG format properly. And folder structured PSTs are structurally different from the multi-labelling system used in Gmail. For eDiscovery purposes, it is safer to export as MBOX – less chance that the smoking gun message fails conversion and is not exported.
For all the techies, an EML file is RFC-5322 formatted data representing a single message (Note that RFC-5322 replaces RFC-2822 which replaced RFC-822) and a MBOX file is a collection of RFC-5322 messages each preceded by a specifically formatted separator line (see RFC-4155).
During eDiscovery processing, it is not acceptable to break a MBOX file into individual files simply by putting each separator line and its following message into the individual files, and then calling the individual files “somefilename.EML”. In this form, the contents of these individual files are still RFC4155 and not truly RFC-5322, and hence are NOT valid EML files. A valid EML file requires the removal/restructuring of the separator line.
Here’s an example of what I am complaining about …… Both of these are valid RFC-# files to the specification referenced.
Even though the difference between them is very small, the second file cannot be considered a valid EML.
One really good reason that this EML standard should be followed: Relativity’s document viewer does not display invalid EML files.
Another good reason: It’s a defined standard… comply with it!
I’ve heard two opposing arguments from “processors” who believe it is acceptable to call these RFC4155 (MBOX) files EMLs:
a) Many applications, such as Outlook, will open these files;
b) The separator line must be maintained because it is metadata that should be kept.
Here’s my opinion on both arguments: NONSENSE.
a) Just because an application (or several), regardless of popularity, is smart enough to compensate for your lack of ability to follow a standard does not imply that you are doing it right. It’s not a grey area. You’re either following the standard and are valid, or you’re not. Don’t assume that every piece of software will be so forgiving.
b) The separator line is typically made of information contained within the other header fields of the RFC-5322 data. Consequently, it does not provide any metadata that is not available elsewhere. Hence, it can be removed. If you still insist on keeping it, then modify it to be a valid RFC-5322 header line by prefixing it with something like “X-RFC4155: “. In the RFC-4155 (MBOX) example above, it would be a valid EML if the first line were removed or changed to:
If you know that you are one of these processing applications that doesn’t follow the standard, please add this correction to your list of future bug fixes. Enough ranting… on to something new.
De-duplication level during document review
I don’t believe there should be any level of de-duplication other than “Global” during a document review. “None” and “Custodial” are options that should not even be presented. That said, fields like “All Custodians” and/or “All Sources” should be available from your eDiscovery processing software to indicate where multiple copies have occurred. At the end of the review, if there is some distinct metadata item (such as “Not Read” in the case of an email) that is needed for a specific custodian’s version of a document, it should be made available from the processing software when needed, typically at production time. Adding dozens of duplicate documents to the review when most of them will be insignificant just causes more work for the review team, more room for coding inconsistencies and is a waste of (often billable) disk space.
But what if one copy is privileged and de-duping might remove the non-privileged copy, or worse, remove the privileged copy allowing the accidental production of privileged information? If you have two copies of a document, the document itself cannot be both privileged and not privileged based on its own content. A law firm that has significant experience as Amici Curiae for Privilege Reviews once told me that “A document can, on its own merit, be considered privileged or not, and that due to various family relations, privilege can be lost or gained.” Hence, differences in privilege for these copies can only be the result of their associated families, as in one is an attachment to a privileged email and the other is not. In this case, however, the emails, as separate individual parent documents, will define privilege for the family.
Hey, you’ve made it to the end of part 2. I hope this was a little more exciting than Part 1. Part 3 will be more thought-provoking (at least to techies like me) as we’ll start discussing future data structures to enhance the eDiscovery experience. eDiscovery Utopia, here we come….
Read part three here.
About The Author
Harold Burt-Gerrans is Director, eDiscovery and Computer Forensics for Epiq Canada. He has over 15 years’ experience in the industry, mostly with H&A eDiscovery (acquired by Epiq, April 1, 2019).