Insights & Data Blog

Insights & Data Blog

Opinions expressed on this blog reflect the writer’s views and not the position of the Capgemini Group

Alternate title: 

5 Things Big Data can learn from ECM
(part 4 of 4)

DATABASE at PostmastersIn the first blog post, I’ve discussed how common practices from the “old” world of Enterprise Content Management can be applied in the “new” Big Data world. Using these practices will create a better governable Big Data environment.

In the previous blog post I’ve discussed Information Life Cycles and privacy. Two topics that need attention for keeping your data set healthy and compliant. Today I’ll be discussing preservation and versioning. These topics are of interest when you want to keep your data meaningful and accessible on the long run.

4. Preservation

When your data has a life span of more than a couple of years, data preservation becomes essential. Preservation is an archival term, denoting the actions you should take to keep documents accessible and readable over time. In some cases, even forever. But data, contrary to paper, has accessibility issues. When data is stored in a format you cannot interpret anymore, it becomes worthless. And how many proprietary database formats have we seen in the last couple of decades? Are you still capable of viewing that information? And moreover, are you still capable of interpreting the information for use in reporting and insights?

In the ECM-world special precautions are defined that will help create an environment that will help to store data over long periods of time. Some issues around preservation can be done:

  • Reformat the data into a more durable format, like (standardized) xml.
  • Replacement of deteriorating media, like CD’s, tape, film or even paper.
  • Migration during the life cycle, e.g. from one cloud provider to another.

For this blog post, the first bullet point is of importance: Reformat the data into a more sustainable format. For documents this can be PDF/A (ISO 32000‑1), for databases this can be SIARD (eCH‑0165). Information can get lost when reformatting data, but is that a small price to pay for preserving data? So maybe it’s a good idea to keep both formats: the raw one and a preserved copy as two versions of the same data.

When you want to put preservation in place, you’ll have to know things about your raw data. In order to reformat, you’ll have to know the format of the raw data. Your storage system should know about the different versions or formats its stores. And which one to use. More about versioning in the next topic. And you should decide when to reformat. At some stage in the data life cycle or immediately at ingest? When you do it during the life cycle, probably most of the data shouldn’t be reformatted at all, because the life span is too short. When you reformat at ingest, you’re basically pre-processing the data into a limited set of formats and models, making data analytics easier.

5. Versioning

Some Hadoop-based systems, like HBase, have versioning of data. Some other systems have versioning on metadata. Most ECM-systems have versioning of documents, some also for metadata. The Wikipedia entry on version control makes a statement that brings versioning into focus:

“The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.”

It allows users to view the history of the data. In the case of ECM, the version history allows users to see how the document has evolved through its life. This history will allow you to gain insights in how certain decisions or policies have been made. This also offers insights into what information was available at a certain point in time. Some source databases keep historical information by themselves, but some do not. And some systems keep only limited historical information. For example, the home addresses of you customers. Maybe your CRM database only keeps the current and a possible previous address. That might be sufficient to support your day-to-day processes: sending information, bills, parcels, orders. But this information is incomplete to know how often a customer moves. And maybe a customer that moves a few times per year has problems in paying his bills? Just a hypothesis…

For me, versioning of data is essential. Well, it all depends on your business and your reporting needs. In retail the need may be small, but in insurance and pharmacy the need for versioning will be large. But anyhow, in all cases you have to think about the value of historical information you keep in your old versions. Big data is, after all, business driven.

Conclusion

In this series of blog posts I’ve discussed how typical Enterprise Content Management related topics can have impact on your Big Data architecture. I’ve discussed five well-known concepts out of the “old” ECM world which can be applied in the “new” Big Data world. I’ve talked about (1) Information Life Cycle, (2) Privacy, (3) Pre-processing, (4) Preservation, and (5) Versioning. There are certainly more topics that would have been interesting to discuss. But considering these topics when creating a Big Data environment will certainly help to build a system that is governable in the long run.

About the author

Reinoud Kaasschieter
Reinoud Kaasschieter
3 Comments Leave a comment
I'm wondering about the use of historical big data. Can you tell a bit more about the versioning of meta data? To which data are you referring? How do you know when you should use a new version? Are there systems that will help the user with that? I also wonder about the data itself, how would you know if the meaning of data changes? Say you have data for 10 years and you have the occupation as one of the fields. If you have a record of 10 years ago and a record of this year with the same occupation in the fields. Will those really be the same? If not, what would be a way to tackle a problem like that? (any help from the system?) If you have an old record with an address in a city, will that address still be in the same city today (in the Netherlands city borders may vary in time). So you may have to be careful how to use that data. Could you write something about these kind of problems?
rkaassch's picture
Hi Job, interesting questions I don't have ready-made answers for. The easiest way to go forward is to create a new version of data when re-ingesting the same data set. This allows the history of the data and you can derive historical trends. That's all I wanted to point out. When a data set, or parts of a data set, doesn't change over time, may cause some kind of redundancy of data. But knowing that data didn't change, is also an insight. The other questions you pose are related to archival or preservation issues around data and metadata. To put it bluntly, when you suspect the meaning of metadata will change during the life span of the data, you should document the meaning of that meta data. Up to a level that the onthology doesn't change. For you city border case, you could document using maps or other geospatial information to describe municipalities in the Netherlands. Assuming that maps will be interpreted the same over time. This is what some experts call metametadata, metadata about metadata. The problem with changing and maintaining onthologies is however a known problem and is being worked on, e.g.: http://www.ibmt.fraunhofer.de/content/dam/ibmt/en/documents/PDFs/ibmt-product-information-sheets/telematics-intelligent-health-systems/MT_ENSURE_en.pdf And I haven't even discussed what you should do when meaning of metadata changes: yes, in that case you should version metadata also.
Reinoud, Thanks. That link is also very interesting.

Leave a comment

Your email address will not be published. Required fields are marked *.