Main menu:



Site search

Recent Comments

  • Jim: on “Weekend At Bernie’s” and End-user based eDiscovery
    VeepGeek -- Thank you for the feedback. I agree that forens...
  • VeepGeek: on “Weekend At Bernie’s” and End-user based eDiscovery
    Excellent article! The only item I would like to have seen ...
  • Dave Swider: on The Future eDiscovery Arms Race: It is all about the Semantics
    Interesting analysis. I'm hopeful that we'll eventually hav...
  • Carlos J. Alarcon: on An eDiscovery Case For ROI: The Five Steps
    We have worked on Information Risk for more than eight years...
  • pensions Birmingham: on The Pension Committee Blog Series: Implications and Questions
    The ediscovery refers to discovery in civil litigation which...

Archives

Blogroll

SourceOne eDiscovery - Kazeon Authors

The Future eDiscovery Arms Race: It is all about the Semantics

J. David Morris, EMC SourceOne eDiscovery - Kazeon

Over the last five years, there has been a confluence of eDiscovery software, case law, information governance policies, and information technology integration, which have all helped shaped today’s eDiscovery market. As we look toward the future, we ask ourselves how to continue to optimize eDiscovery to intelligently reduce data volumes, efficiently decrease collection quantity, and streamline the review process, while also delivering the highest possible document accuracy, reliability, and repeatability. Finding relevant ESI is becoming more challenging for organizations, as ESI volume increases and is spread across email systems, file shares, and laptop/desktops.  To complicate matters further, attorneys are expanding discovery motions to include new ESI repositories, such as SharePoint and other collaborative tools, which further increase ESI volume and complicates identification and collection.  So, how can the problem be addressed to balance the opposing constraints of ESI volume, eDiscovery expense and relevant document precision and accuracy? The answer is simple, but will be challenging to implement.  The future eDiscovery arms race is in the development of advanced, intelligent analytics capabilities.  In other words, it is all about the semantics.

The first advance in reducing non-relevant ESI collection (or ESI culling) was simple file identification.  Software, such as EMC’s Kazeon, delivered the capability to identify file types quickly and easily to exclude operating system files (e.g. CABs) and other program executables files (e.g. Word, Excel, PowerPoint, Numbers, Keynote, etc.), which are resident on all computers and do not contain any relevant ESI. File identification technology was a quantum leap. It reduced collection volume by 50% to 60% over the traditional brute force forensic collection, which copies entire disk drives.  The second advance was Boolean keyword search, which has been a powerful eDiscovery tool.  Over time, keyword search has become more sophisticated with the addition of keyword spelling variants and root word variations. This increased the keyword search accuracy by including common misspellings and root variants, like talk vs. talking.  However, keyword search requires a priori knowledge for what one is looking, which is problematic and a limiting success factor.  As valuable as it is, keyword searches often include many non-relevant documents (false positives) or exclude too many relevant documents (false negatives).  The complication is within our language usage. We have a synonymy effect, which is that one of two or more words in the same language have the same meaning (as in “student” and “pupil”), as well as the polysemy effect, which is that many individual words have more than one meaning.  The impact of polysemy on search complexity is as follows:

Polysemy is a major obstacle for all computer systems that attempt to deal with human language. In English, most frequently used terms have several common meanings. For example, the word fire can mean: a combustion activity; to terminate employment; to launch, or to excite (as in fire up). For the 200 most-polysemous terms in English, the typical verb has more than twelve common meanings, or senses. The typical noun from this set has more than eight common senses. For the 2000 most-polysemous terms in English, the typical verb has more than eight common senses and the typical noun has more than five[1].

The English language complexity impacts our ability to search and identify relevant information with efficiency, accuracy, and precision. If we consider the addition of other languages on the identification and search challenge, we have to tackle the semantic differences, as well as additional translation complexities between languages.

What are the next steps in search and identification analytics technologies? There are nascent concept search capabilities in today’s market, which have been developed to circumvent the limitations of Boolean keyword search when dealing with large, unstructured ESI.  The idea is to develop the ability to search on an idea and retrieve responses, which are relevant to the concept of the idea.  With synonymy and polysemy effects, an idea can be represented by numerous loosely related terms.  Research in the following areas of concept search hold promise to increase search relevance and accuracy:

  1. 1. Word Sense Disambiguation (WSD)[2]

WSD technologies help derive the actual meanings of the words, and their underlying concepts, rather than by simply matching character strings like keyword search technologies. Research has progressed steadily to the point where WSD systems achieve sufficiently high levels of accuracy on a variety of word types and ambiguities.

  1. 2. Latent Semantic Analysis (LSA)[3]

LSA is a natural language processing technique that uses vectorial semantics (documents and queries are represented as vectors with in a linear algebra matrix) to analyze relationships between a set of documents and the terms they contain and how the terms are correlated.  After analyzing, LSA constructs a set of related concepts to the document and terms therein.  In other words, LSA searches documents for themes within the language usage and extracts the concepts, which are common to the documents.

  1. 3. Local Co-Occurrence Statistics[4]

Local co-Occurrence Statistics is a technique that counts the number of times of pairs of term appear together (co-occur) within a given period, where a period is equal to a predetermined window of terms or sentences within a document or documents.

Each of the above techniques by themselves will not likely be a complete solution to the eDiscovery concept search challenge.  However, these methods combined and intelligently integrated together within an overarching concept search paradigm will be the start in the right direction.  As focus increases on conceptual search technologies, the winning products will likely have the best analytical technologies.

Discover More


[1] http://en.wikipedia.org/wiki/Concept_Search#Auxiliary_Structures

[2] http://en.wikipedia.org/wiki/Word_Sense_Disambiguation

[3] http://en.wikipedia.org/wiki/Latent_semantic_analysis

[4] Bradford, R. B., Why LSI? Latent Semantic Indexing and Information Retrieval, White Paper, Content Analyst Company, LLC, 2008.

Comments

Comment from Dave Swider
Time August 26, 2010 at 08:56

Interesting analysis. I’m hopeful that we’ll eventually have a meet and confer that talks about concepts for collection, rather than keywords. I don’t see it happening until we have more sophistication among both the bench and attorneys, but I feel like it’s coming soon.

Write a comment