DATABASE at PostmastersIn the previous blog post, I’ve discussed how common practices from the “old” world of Enterprise Content Management can be applied in the “new” Big Data world. Using these practices will create a better governable Big Data environment: (1) Information Life Cycle, (2) Privacy, (3) Pre-processing, (4) Preservation, and (5) Versioning.

In this blog post I’ll start by discussing Information Life Cycles and how retention schemes will help to control data growth by purging data that is no longer of value.

1. Information Life Cycle

Every full blown ECM-system should have document life cycles in place. In my experience, without life cycle management, ECM systems will eventually suffer from data growth issues and become non-compliant. Life cycle management will define when data can or must be discarded. Before that, the life cycle policies can redefine the access rules to documents, e.g. when documents are to be declassified. Or life cycles policies will prescribe how documents have to be stored: in what format and on what media. In my opinion, all data in a Big Data environment should have life cycle policies imposed. In another blog post I’ve described why retention policies are necessary. Otherwise, governing the data will become cumbersome, if not impossible. In any case, defining when data objects can be deleted will tackle data growth issues and will help in staying compliant. For instance with privacy related data.

In document management we can take another approach. Records Management states which documents you’ve to keep for a certain amount of time, mostly for legal reasons. The documents, which can be paper, reports, e-mails, movies, computer files and so on, should be kept in such a way that the contents and metadata are not modifiable. Where retention described what you can delete, Records Management described what you’ll have to keep. In a Big data environment, I hope both retention methods are in place, because they’re complimentary to each other.

Another, long-term discussion in ECM is about storing data immutably. This will help with your legal use cases, because you can prove that the data hasn’t been tampered with. Printing reports of snapshots of the databases might do, but it isn’t very sustainable. When your Big Data environment can store data immutably in a proven way, you’ve at least secured that the data is original from the moment of ingest. When data can be used in legal cases, think about this option.

Storage is getting cheaper and cheaper and more abundant. This makes life cycle policies for managing physical storage less prevalent. Adding storage is nowadays the most common method of tackling data growth. But where data grows exponentially, storage prices don’t lower at the same rate. Future costs for storage will rise. We can use retention to save storage space. But we do need retention to remain compliant and destroy data we aren’t allowed to keep. And in my opinion, it doesn’t really make sense storing data without any business purpose; or data that we cannot interpret anymore, where meaning and context has been lost in time.

At a large manufacturing company, the discussion came up with what to do with “old” data from retired applications. Retiring applications implies migrating functionality and data to a more modern system. The questions that rose were: what portion of data do we move to the new system, can we discard data that is no longer of use or should we keep this data for some unforeseen future use? These questions couldn’t be answered because the company didn’t have any clue about their data life cycles. They were unable to assess the risk and costs around keeping “old” data: does the data contain information we need to discard or do we need the data in the future? For what use? What are the risks when we cannot find information or what are the risks of data breaches?

Thinking and deciding about Information Life Cycles is essential to be able to manage your data in an orderly manner, based upon your business requirements.

In the next blog post, I’ll discuss the next two topics.

Photo CC BY-SA 2.0 by Michael Mandiberg via Flickr