Machine Learning For Document Review: The Numbers Don’t Lie
By James D. Shook, Esq.
In light of Magistrate Judge Andrew Peck’s recent decision in Da Silva Moore v. Publicis, much has been written and discussed about the idea of using machine learning techniques to automatically classify documents during review, a process sometimes known as “predictive coding” or even “computer assisted review”. (Although these terms may actually imply different technologies and processes this article adopts Judge Peck’s umbrella use of the term “predictive coding”). This article explores some of the key issues around this promising intersection of law and technology.
What Is Predictive Coding? How is It Used?
At a simple level, predictive coding is just a technological “lever” that allows a (relatively) small amount of review work – usually by humans — to be leveraged across a much larger set of documents. Let’s say that we have a class action where the identification phase of eDiscovery has located about twenty million electronically stored information (ESI) items – email messages, word processing files, spreadsheets, powerpoint presentations, etc. – that are likely to be relevant, discoverable information for the issues in our case.
Traditionally, we have had a few choices about what to with all of this ESI. First, we could just hand it all over to the other party without reviewing any of it for actual relevance or even privilege. Second, we could negotiate search terms with the other side, which we would then run against the ESI in an attempt to locate the most relevant information. All non-privileged ESI that was a “hit” with the search terms would be handed over to the other side. Third, we could have human reviewers reach each item to determine what is relevant (and non-privileged), and then produce that information. Each of these approaches has its benefits and problems, including the amount of time that the process takes, the cost of reviewing millions of items, the validity and usefulness of the process, etc.
With predictive coding, we have another choice to help determine what information is relevant. In this approach, we carefully review a small subset of the 20,000,000 item set – maybe as few as 2,500 items depending upon the margin of error that we can tolerate in our result. We then use predictive coding technology to “learn” from that set, applying that “knowledge” to the remaining 19,997,500 items.
Is Predictive Coding Worth the Trouble?
The effective use of predictive coding impacts the litigation process by substantially reducing the cost and time for review. In our example, in applying a predictive coding process we might manually review just 2,500 documents of the 20,000,000 set, depending on the specific technology, case requirements and tolerance for potential errors. If we assume a cost of $5 per document for high-level manual review (by a more experienced attorney) and just $0.50 for bulk review (by contract attorneys and paralegals), we would spend about $12,500 in review (plus the costs of the predictive coding technology process). This cost is in stark contrast to a complete manual review – which in this case could cost up to $10,000,000! (Note that in Da Silva, there were about 3.2 million documents. At one point the parties estimated a cost of $5 document to produce a projected set of 40,000 documents. Hearing Transcript at 62; Order at 6).
The predictive coding process takes less time because the computers handling the review are much faster, more consistent and can work longer hours than people. If we assume a review rate of two documents per minute, then the manual phase of our predictive coding process requires about 2 people-days (the machine-based review would come after this phase). In contrast, a complete, eyes-on review process would take 20,833 people-days! Even with a large team of 100 reviewers, that process would take over 200 working days and require strong project management to complete properly.
Perhaps more important, according to recent studies the predictive coding process is also more effective than human or keyword review. Unfortunately, it is difficult to determine the true accuracy of human review because opinions, even among experts, can vary on whether a document is relevant to a case. (Maura Grossman & Gordon Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII Rich. J.L. & Tech. 11 92011 at 9). But the bulk of available information implies that machine coding is better. In fact, some studies put a human reviewer’s recall– the percentage of relevant documents actually located – at less than 50%. The use of basic keywords is even worse, dropping recall to about 25%. (Grossman/Cormack at 18-19). Some predictive coding studies indicate that the process is far more accurate, in the range of 70% recall (Grossman/Cormack at 36-37). Given the lower cost and speed, recall that’s even close to the human level should be acceptable. (Note that other measures, such as precision and F1 (the harmonic mean of recall and precision) – are also important in this process. For more information see Grossman/Cormack at 9).
What are the Barriers to Using Predictive Coding?
Despite Judge Peck’s opinion, there are number of real-world barriers to using predictive coding on a regular basis:
- The underlying technology and math can appear complicated, and Judges and lawyers may not jump at the chance to use predictive coding until they better understand the process and there is more guidance from the bench on when and how to use it. Although Judge Peck has stated that the requirements of Daubert do not apply to predictive coding, there remains a comfort level and learning curve that will probably take some time to achieve;
- There are not yet many studies establishing that predictive coding is better than human or keyword review, even though many believe that is the case. Further, “predictive coding” is just an umbrella term for
- There is not a clear process of how to approach the problem when one side wants to use predictive coding by is opposed by the other party. In Da Silva Moore v. Publicis, both parties agreed to use predictive coding and Judge Peck’s order addresses issues related to the protocol and process of how it is to be used in the case. However, in Kleen Products LLC v. Packaging Corp. of America, Magistrate Judge Nan Nolan will be focusing on that issue: can a party require computer-assisted review over the objection of another party? Stay tuned — Kleen Products has another hearing scheduled for April 2012.
What Are the Barriers to Not Using Predictive Coding?
Interestingly, the continued massive growth of data may ultimately force the use of predictive coding technologies. IDC projects that most companies will be dealing with 50 times the amount of data in 2020 than they had in 2011. People working in the eDiscovery industry also know that most cases today ignore many potentially relevant repositories of data, either by agreement or thru lacking of knowledge. If the additional repositories are included, along with the enormous projected growth of data, it’s likely that the amount of ESI in many cases will soon be too expensive for “normal” review by humans — or it will take too long.
In addition, while it may seem farfetched today, the requirements of proportionality could soon mandate that parties use predictive coding technologies to insure that the litigation process remains just, speedy, and inexpensive.
What is the Future?
With cases like Kleen Products on the near horizon, it seems likely that we will be receiving some judicial guidance on predictive coding over the next year. That’s certainly good news, as the more guidance we receive, the more likely that lawyers and litigants grow more comfortable with the process.
In addition, predictive coding technologies show promise outside of the litigation process to help with our information management overload issues. Imagine training your email, fileshare or Sharepoint archive, or even your records management system, to recognize and automatically classify information as it is received or created. “I’ll get milk on the way home” messages could be flagged for quick deletion, while employee reviews, contract modifications and other records could be stored according to your file classification plan. (This technology already exists; improvements, higher comfort level and better understanding of the technologies caused by their use in litigation will help with the adoption rate).
For more, check out the references in “Dive Into Machine Classification and Coding”, part of my New Year’s Wish list.
Posted By: David in eDiscovery on March 13th, 2012.
Tags: Analysis & Review, Chain of evidence, Collection & Culling, Da Silva Moore v. Publicis, e-discovery, eDiscovery, eDiscovery StraightTalk, electronic discovery, emc, EMC SourceOne eDIscovery - Kazeon, end-to-end ediscovery, ESQ., J. David Morris, James D. Shook, Kazeon, legal ediscovery, legal hold, Machine Learning, Machine Learning For Document Review, Maura Grossman & Gordon Cormack, Predictive Coding, SourceOne, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, The Honorable Andrew J. Peck