by Harold Burt-Gerrans
I’m approaching this multi-part article from a software development point of view, as I believe many of the following issues have been brought about by the evolution of eDiscovery software following the procedures used by handling boxes of paper. Historically, my opinion has been that eDiscovery software providers have engineered new features into the legacy data structures that modeled scanning environments without stepping back and determining that there should be an overhaul to those legacy data structures to adapt to the ever changing ESI world.
Much of what I will discuss revolves around the handling of duplicative data, with deviations into the future of eDiscovery review platform structures and processes. This initial set of points just gets us rolling with some of the basics, so Part 1 is a little boring but I’ll build from there. And occasionally, I might rant about an issue or two…
Many years ago, I worked in the automotive industry. My team was approached by one of the car manufacturers to develop an application for their suppliers that would allow them to receive the electronic purchase orders. At the time, the manufacturer believed that completely switching to Electronic Data Interchange for manufacturing P.O.’s would save $200/car, and at a rate of a car/minute/production line, the savings were well over 250K per day in Oshawa alone. We built that application somewhat customized to the initial manufacturer’s specifications.
Once we had it though, we were approached by the other manufacturers to provide the application to their suppliers – often there was overlap in that some suppliers worked with multiple car manufacturers. So, we modified the application to strictly follow ANSI and EDIFACT standards and satisfied all the car manufacturers. But, because we built in customizable data entry screens, communications protocols and report generation, over a few short years we sold thousands of copies to a variety of industries, including:
- several hundred copies to a Canadian telecom company that implemented the customizable data entry screens to take client service requests; and
- several thousand copies to a major American beer company who provided a free workstation and software to bars on the theory that making it easier for bars to order their products would increase their overall sales. I think it worked as they still reign over the American beer industry.
Why were we successful in so many industry verticals? One single reason – standardization. Maybe our dedication to broad functionality and extreme user friendliness helped, but without standardization, we couldn’t have entered so many markets.
Standardization is the area that is not up to standard (no pun intended) within the eDiscovery industry. At this point, I only have suggestions that I would hope eDiscovery providers might adhere to. I’m sure it will take an industry committee consisting of eDiscovery software providers and legal consultants to work out a set of rules – maybe a more granular technical version of Sedona. My suggestion for standards should address at least the following two issues (with considerations for the additional concepts to be discussed in the future article releases):
Standardized de-duplication routines
Recently, we worked on a project where data was provided to us and similar data to the opposing party. Our client had already loaded the data into the review platform to do their own review and when the opposition provided productions of several tens of thousands of documents, we needed to establish specific links in the review platform for the documents we had processed that matched those that the other party had produced.
The biggest stumbling block to this matching was that the opposing party used different processing software than we did, and consequently, none of the hash calculations between the two data sets could be used to match parent and/or child documents. Unfortunately, the source path information from the opposing party could not be used to match back to our source path information either (and it didn’t help that the opposing party did a poor de-duplication overall). Consequently, we needed to re-process thousands of files, at the expense of our client, from the opposing party’s production just to get metadata and hash calculations that could be used to electronically match the opposition productions to the review case documents. Would it have been different if we had used the same processing software? Probably not, because there are variable options within processing tools (that I have seen) that control which metadata is used for calculating de-duplication hashes of emails.
What needs to be in place is a set of rules that the industry follows to calculate the de-duplication hashes that are:
- Consistent between processing tools so that the same result is generated regardless of the tool. This should include controls so that the de-duplication strings are built using consistent data fields and adjust for variations in email addresses, display names, title, message body text, message formats (msg/eml) etc.;
- Calculated on each level down through a family so that a child of one parent can be matched to the grandchild of another (or a stand-alone parent can be matched to a child of another parent);
- Consistent with the treatment of containers (i.e. does a zip file count as a parent to its contents when the zip file is a child of another parent? Personally, I say always ignore them as individual items and put their contents as children of the zip’s parent, but include the zip file names as part of the source path structures); and
- Capable of handling variations in children (i.e. two copies of the same email may exist but one has updated children – hence the two emails may be considered duplicates but their families are different and should be treated as such).
It is likely that no single standard can be used across the board, but the options used in processing could be standardized and included in the production information (i.e. was “BCC” used as a metadata field for de-duplication). Ideally, the initial meetings between parties to decide document exchange protocols could also determine which standardized processing options are followed for de-duplication.
Standardized interchange architectures
This will be short – perhaps it’s more of a complaint than a suggestion. It would be nice if there were a set of rules that everyone followed on productions so that when the other side receives a production, it is complete and error-free. Too often we spend billable time reviewing and adjusting other productions to get them into a format to load into our review platforms. In some cases, these may be limitations of the review platform itself (Hey Relativity, what’s wrong with grey scale TIFFs?). Overall, a set of rules that govern productions (with flexibility for the small percentage of cases that need special handling) that are enforced by the processing/review/production software would save hours of time on the part of the recipients to load the documents into their chosen platform.
Hey, you’ve made it to the end of part 1. I hope I didn’t bore you too much. I promise the next part will be a bit more entertaining. eDiscovery Utopia, here we come….
About The Author
Harold Burt-Gerrans is Director, eDiscovery and Computer Forensics for Epiq Canada. He has over 15 years’ experience in the industry, mostly with H&A eDiscovery (acquired by Epiq, April 1, 2019).