By: Bill Tolson on October 27th, 2016

Print/Save as PDF

Categorizing Grey Data; Part 2

grey data | Archive2Azure


As I discussed in Part 1 of this blog series, your enterprise data is comprised of approximately 30% grey data, or unstructured data that is low-touch or abandoned that for various reasons, your legal department is not willing to dispose of. This data can be content from departed employees, data that has aged beyond the standard retention periods but due to extenuating circumstances still needs to be retained, eDiscovery data sets from past cases, or content considered corporate history. The question is; how do you determine what is grey data versus truly valueless data?

First Tackle the Obvious

Referring back to the CGOC numbers from Part 1, you should be able to quickly determine what data is subject to legal hold, regulatory retention, or has obvious business value based on how your organization handles data generally. The difficult process is gleaning the grey from valueless data. In Part 1, I suggested that culling for valueless data is not the best strategy. Let me clarify by saying that culling for obvious valueless data only is not a best practice.

To begin the culling process, first concentrate on those files that are obviously valueless such as:

  • Duplicate files: There can be large numbers of duplicates in the file shares, document repositories, and PSTs spread around the enterprise file shares.
  • Revisions: Documents can have several revisions the final document was created from. The revisions usually include structural changes, edits, added content, and comments. The question is; are the revisions important when determining value? In most cases the answer is no for aging files.
  • Aging backups: Backups of both desktops and servers/storage beyond a certain age are almost always valueless. Ask yourself the following question; what could I possibly do with an email system backup from seven years ago? In reality, backups are for disaster recovery purposes and should only be kept for short periods of time, i.e. 3 months, otherwise they become useless.
  • Aging system files and system reports: Again, what value does a system report from 3 years ago have?
  • Non-business related or personal MP3’s and video files: These files can take up large amounts of enterprise storage. Send an email out to employees say that they have 2 weeks to move these files off of company assets and at the end of two weeks all files matching these profiles will be deleted

This is not an extensive list however you get the idea, use common sense here.

The Not So Obvious

The next step is to create a policy for determining, for the vast majority of unstructured data in the enterprise, what low-touch or grey data still rises to the level of retention? After disposing of the obvious, the next step is to begin culling on other data points such as:

  • Last accessed date: If data is new or relatively new, then it no doubt belongs to current employees and still might have a relatively high probability of review/reference (refer to the Lifecycle of Grey Data blog) . It’s never a good strategy to delete relatively new content without the owner’s knowledge. Employees can waste huge amounts of productivity searching for a file they are sure they just created 1 month ago.
  • Target Custodians: Companies should develop a list of those employees whose data will not be culled and deleted for any reason, for example the CEO, GC, or specific engineers developing IP for both legal and historical reasons.
  • Departed Employees: Data from departed employees such as mailbox content, email archive content, file system content, cloud data, and data from their workstations should be collected and held for a period of time as defined by corporate legal. This data can be instrumental if later wrongful termination lawsuits are files. This data is more easily collected as the employee is actually leaving the company ort shortly after.
  • Author-less Content: In rare occasions, data files will not have an easily discernible author. In this event, keyword filtering can help determine content value.
  • PSTs: Again, PSTs can sometimes be difficult to determine ownership. Cracking open the PST (if it’s not password protected) can help you quickly establish ownership.

The above bullets are the most productive culling points but many others can also exist depending on your specific industry.

Next Steps After Categorization – Store It

So what should you do with this grey data after you have finished the filtering/categorization? Obviously you began the process to save it. The questions are: for how long and where?

You should develop a policy for handling grey data. First, create high water mark retention periods, for example the time period for your local statute of limitations for employee wrongful termination lawsuits.

Second, establish a secure low cost repository that can be managed and searched when needed. This repository should also include in-place legal hold and retention/disposition functionality so that this grey data can eventually be disposed of.

Microsoft Azure as the Managed Grey Data Repository

Archive2Azure is Archive360’s Compliance Storage Solution targeting long term storage and management of unstructured grey data into the Microsoft Azure platform. The Archive2Azure solution leverages Microsoft Azure’s low-cost ‘cool’ storage as an alternative to expensive on premise enterprise storage.  Azure costs as little as $0.02 per GB per month and eliminates all the expensive overhead costs of traditional on premise storage.

Archive2Azure importantly provides automated retention, indexing on demand, encryption, search, review, and production – all important components of a low cost, searchable storage solution. Given the clear cost advantages of the Azure cloud, it’s no surprise many companies are looking to Azure and Archive2Azure for grey data management and storage.

Request Archive2Azure Demo!

About Bill Tolson

Bill is the Vice President of Global Compliance for Archive360. Bill brings more than 29 years of experience with multinational corporations and technology start-ups, including 19-plus years in the archiving, information governance, and eDiscovery markets. Bill is a frequent speaker at legal and information governance industry events and has authored numerous eBooks, articles and blogs.