In the first blog post, I’ve discussed how common practices from the “old” world of Enterprise Content Management can be applied in the “new” Big Data world. Using these practices will create a better governable Big Data environment.
In the previous blog post I’ve discussed Information Life Cycles and privacy. Two topics that need attention for keeping your data set healthy and compliant. Today I’ll be discussing preservation and versioning. These topics are of interest when you want to keep your data meaningful and accessible on the long run.
When your data has a life span of more than a couple of years, data preservation becomes essential. Preservation is an archival term, denoting the actions you should take to keep documents accessible and readable over time. In some cases, even forever. But data, contrary to paper, has accessibility issues. When data is stored in a format you cannot interpret anymore, it becomes worthless. And how many proprietary database formats have we seen in the last couple of decades? Are you still capable of viewing that information? And moreover, are you still capable of interpreting the information for use in reporting and insights?
In the ECM-world special precautions are defined that will help create an environment that will help to store data over long periods of time. Some issues around preservation can be done:
- Reformat the data into a more durable format, like (standardized) xml.
- Replacement of deteriorating media, like CD’s, tape, film or even paper.
- Migration during the life cycle, e.g. from one cloud provider to another.
For this blog post, the first bullet point is of importance: Reformat the data into a more sustainable format. For documents this can be PDF/A (ISO 32000‑1), for databases this can be SIARD (eCH‑0165). Information can get lost when reformatting data, but is that a small price to pay for preserving data? So maybe it’s a good idea to keep both formats: the raw one and a preserved copy as two versions of the same data.
When you want to put preservation in place, you’ll have to know things about your raw data. In order to reformat, you’ll have to know the format of the raw data. Your storage system should know about the different versions or formats its stores. And which one to use. More about versioning in the next topic. And you should decide when to reformat. At some stage in the data life cycle or immediately at ingest? When you do it during the life cycle, probably most of the data shouldn’t be reformatted at all, because the life span is too short. When you reformat at ingest, you’re basically pre-processing the data into a limited set of formats and models, making data analytics easier.
Some Hadoop-based systems, like HBase, have versioning of data. Some other systems have versioning on metadata. Most ECM-systems have versioning of documents, some also for metadata. The Wikipedia entry on version control makes a statement that brings versioning into focus:
“The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.”
It allows users to view the history of the data. In the case of ECM, the version history allows users to see how the document has evolved through its life. This history will allow you to gain insights in how certain decisions or policies have been made. This also offers insights into what information was available at a certain point in time. Some source databases keep historical information by themselves, but some do not. And some systems keep only limited historical information. For example, the home addresses of you customers. Maybe your CRM database only keeps the current and a possible previous address. That might be sufficient to support your day-to-day processes: sending information, bills, parcels, orders. But this information is incomplete to know how often a customer moves. And maybe a customer that moves a few times per year has problems in paying his bills? Just a hypothesis…
For me, versioning of data is essential. Well, it all depends on your business and your reporting needs. In retail the need may be small, but in insurance and pharmacy the need for versioning will be large. But anyhow, in all cases you have to think about the value of historical information you keep in your old versions. Big data is, after all, business driven.
In this series of blog posts I’ve discussed how typical Enterprise Content Management related topics can have impact on your Big Data architecture. I’ve discussed five well-known concepts out of the “old” ECM world which can be applied in the “new” Big Data world. I’ve talked about (1) Information Life Cycle, (2) Privacy, (3) Pre-processing, (4) Preservation, and (5) Versioning. There are certainly more topics that would have been interesting to discuss. But considering these topics when creating a Big Data environment will certainly help to build a system that is governable in the long run.