DATABASE at PostmastersOne of the essential principles of Big Data is storing data in raw format: No transformation during ingest. The data is interpreted and transformed when read and analyzed. Though this principle has its virtues, there are some drawbacks. With years of experience in Enterprise Content Management (ECM), dealing with all kinds of data, I would like to discuss five topics where this principle of Big Data needs some adjustments.

But let’s start with describing what Enterprise Content Management (ECM) is: “ECM is an umbrella term covering document management, web content management, search, collaboration, records management, Digital Asset Management (DAM), workflow management, capture and scanning. ECM is primarily aimed at managing the life-cycle of information from initial publication or creation all the way through archival and eventually disposal.” (source: Wikipedia) In my opinion, most ECM systems deal with unstructured data: documents, e-mails, blogs, tweets, sound files, movies and so on. We’ll call all those objects “documents”, though I’m very aware that these data sources aren’t documents in the conventional sense.

Enterprise Content Management (ECM) has been around for over three decades now. In the 1980’s and 1990’s, ECM systems were especially designed to store large quantities of data, up to 33TB in specially designed jukeboxes for optical platters, like CD’s and DVD’s. The more advanced ECM systems were designed to cope with these large quantities of data, managing the data, keeping  it accessible and meaningful.

Nowadays, storing large quantities of data is no longer an issue. Storage is cheap. The difference between structured data- mostly data in a tabular form, and unstructured data- data not organized in a pre-defined manner, is blurring. In a Big Data environment, distinguishing between structured and unstructured is no longer useful, it’s all data. ECM and Business Intelligence have a common aim to create value from the information, either in a transactional or an analytical manner. Formatting is no longer a major issue; it’s deriving meaning out of information and the interpretation of it, that should be your focus.

Content analytics can automatically extract meaning from written text. Machine learning, like IBM Watson, can interpret unstructured data. Combine these with the traditional structured data, like databases, and all data in a Big Data environment can be processed and analyzed the same way. This is becoming very powerful, even more when you realize that up to 80% of all data in the world is unstructured.

Modern Big Data systems can process structured and unstructured data without problems. Big Data states that you just store the data. You can interpret and process it later, when needed for getting insights. But is just storing data as-is, without any additional measures, the desired method of ingesting data? During my career in ECM, I’ve learned that just bluntly storing raw data or documents is not always the best way to go forward. Sometimes additional measures are to be taken to fulfill other requirements, like legal obligations, keeping data accessible and governing the data itself.

In this series of blog posts I would like to discuss five well-known concepts out of the “old” Enterprise Content Management world which can be applied in the “new” Big Data world. Considering these topics when creating a Big Data environment will certainly help to build a system that is governable in the long run. When you’ve applied governance in Big Data environments, you’ll probably be familiar with these topics. But for anyone who’s new in the field, it’s always good to learn from common practices of others.

I’ll discuss five topics in the next three blog posts:

  1. Information Life Cycle
  2. Privacy
  3. Pre-processing
  4. Preservation
  5. Versioning

In my next blog post, I’ll start off with Information Life Cycle, a term that combines both Data Life Cycle and Document Life Cycle.

Photo CC BY-SA 2.0 by Michael Mandiberg via Flickr