Defensible Data Disposition, Machine Learning, and the Cloud
- Bill Tolson |
- September 20, 2017 |
- minute read
The concept of Defensible Disposition has been around for many years. Defensible Disposition is the process of disposing of unneeded and valueless information in a manner that provides information about the disposition process showing that deleted data was not under regulatory retention requirements and the data was not subject to current or anticipated eDiscovery. In short, a data disposition process that ensures regulatory and legal considerations are taken into account.
Defensible Disposition; simple concept, complex process
Defensible Disposition is a relatively simple concept that in reality is difficult to implement. For example it takes a large commitment from C-level management due to its cost and timeframe. Those companies that actually start the process either hire consultants (expensive) to do it or rely on employees (risky), which takes them out of their standard work and can affect the productivity of employees across the company.
Hiring several consultants at several hundred dollars an hour is obviously expensive because it takes time for them to come up to speed while still relying on employees to “assist” in the project. Relying on employees to accomplish the project can take even longer to learn the proper way to come up to speed.
Based on my experience, neither one of these strategies are optimal so many/most companies decide to put the project off hoping technology will somehow save them sometime in the future.
Can technology save the day?
Seven or so years ago, predictive coding began making a splash on the eDiscovery scene. Predictive coding is the concept of using machine learning techniques via computer algorithms to train computers to search large data sets looking to determine which data is responsive (or non-responsive) to the eDiscovery request based on training the computer by providing examples of the kinds of data that could be relevant.
The predictive coding of several years ago used a “supervised” machine learning model. This is where a human provides the computer examples of both relevant and non-relevant information and then runs a test cycle to determine what the program got right and wrong in the predictive coding process. The human reviews the computer’s results providing it feedback on what it got right and wrong. This training period can take 5, 10, 25, 50, or more training cycles.
In this way the computer “learns” what data is relevant to the eDiscovery order so that searching, culling, and tagging huge collected data sets for potentially responsive (and/or privileged) information is sped up radically while also raising the accuracy levels. Usually legal personnel (or data scientists) control the training process by specifying relevant criteria and performing the training cycles. Predictive coding (machine learning) can speed the process of discovery review and reduce the cost by 80 to 90%. Over the years the track record of this machine learning in the courts has proven itself and become widely accepted.
Predictive coding and auto categorization of data
The obvious next step in utilizing machine learning was to automate data categorization while raising accuracy over that of manual, individual employee categorization. This concept approaches the long awaited capability that all information governance professionals have been waiting for – to take the categorization of the huge amounts of data employees encounter every day out of their hands and automate it. The issue is that most employees don’t actually have time to categorize, correctly store, and apply retention/disposition policies to their data every day thus causing the huge stores of unstructured data clogging up enterprise storage systems. In reality, machine learning-based categorization will produce consistency and much higher accuracy over that of manual categorization.
In January of 2015, Microsoft acquired Equivio, a provider of machine learning technologies for eDiscovery and information governance. Over the next couple of years, Microsoft embedded this machine learning technology into its Office 365 cloud platform in its E5 license which offers predictive coding capability for discovery of Office 365 data.
This year, Microsoft incorporated this machine learning technology into their Azure Cloud platform to enable their Cognitive and Media Services capabilities. The exciting thing about this technology on the Azure platform is that now vendors can build Azure applications that utilize machine learning at a much lower cost.
Machine learning and defensible data disposition
The next logical step in using machine learning is to utilize auto-categorization to determine what data is valueless, a copy, or beyond its retention period to set the basis of defensible disposition. Again, defensible disposition is the process of disposing of data that is no longer needed for the running of the business and is not subject to regulatory retention and is not subject to a current or anticipated legal hold.
Machine learning for defensible disposition can be used in two ways; to categorize and dispose of the huge stockpiles of existing data around the enterprise, and to perform on-going categorization and retention/disposition of live data – to ensure buildup of unmanaged data never happen again.
Earlier in this article I mentioned that predictive coding for eDiscovery used a “supervised” machine learning model - meaning it relied on human interaction to train it. With the amount of information already sitting in enterprises as well as the sheer volume of live data entering and leaving the enterprise, a supervised machine learning model would not be feasible.
The computer trains itself
For auto-categorization and defensible disposition to work, a self-learning or “unsupervised” machine learning model would need to be used.
In unsupervised machine learning, there is no training data set or training cycles needed. Essentially the program trains itself based on the data set provided. Unsupervised machine learning opens the door to ongoing auto-categorization and defensible disposition of live data.
The only caveat for this to work is all corporate data must be stored and available centrally so the program can manage it. This means that all employee computers need to be synced, or for laptops, download data to a central location on a regular basis. But the benefits far outweigh ignoring the problem. With predictive auto-categorization, the company addresses the problem of huge, unmanaged employee data – typically 80% of all data in an enterprise.
In the near future, unsupervised machine learning and auto-categorization will be the norm. The question is how expensive will it be…
The Azure Cloud and Archive2Azure
To make machine learning capabilities available to all at a low price, cloud platforms like Microsoft Azure will need to offer machine learning technology as an included service - in reality, Microsoft has already done this.
Archive360 is the first cloud-managed storage and archive solution for compliance and long-term data management built on Azure Cloud Services. Archive2Azure creates a highly secure and low cost, legally compliant enterprise storage repository and archive perfect for the storage and management of records, unstructured data, and legal data sets. Because it’s built on the Azure Cloud, Microsoft’s machine learning technology is already available to all Azure application developers so auto-categorization and defensible disposition is just around the corner.
It could be available sooner than you expect so begin moving your data to the Azure Cloud and Archive2Azure - now.
If you’re journaling today, the stakes are high.
Your legal, compliance and security teams rely on having an immutable copy of all of your emails. Office 365 archiving does not support journaling. So what should we do?
This eBook provides actionable tips to empower IT to solve the problem.
Bill is the Vice President of Global Compliance for Archive360. Bill brings more than 29 years of experience with multinational corporations and technology start-ups, including 19-plus years in the archiving, information governance, and eDiscovery markets. Bill is a frequent speaker at legal and information governance industry events and has authored numerous eBooks, articles and blogs.