“Time is made up of captured moments and things shared…” (unknown)

collection of booksCapgemini’s recent study, in collaboration with Emc²: “Big & Fast Data: The Rise of Insight-Driven Business”, concludes that Big Data is here and here to stay. Getting the most out of Big Data is the new challenge. But before we can use Big Data, we have to capture it.

Big Data is, of course, all about large amounts of data to be captured and analyzed and stored. Like Wikipedia defines: “Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy.”

Ever growing amounts of unstructured data

There are several sources of Big Data which generate a lot of new data. I want to focus on the ever growing amounts of unstructured data, now mostly generated through the internet and social media. Text messages, tweets, (blog) posts are increasingly becoming an important source of data relevant for your organization. They reflect the sentiment about your image and products, they’re a source for spotting new trends and innovations, they contain valuable competitor information. Capturing unstructured data, traditionally documents and email messages, has been the territory of Enterprise Content Management. Document capture software has been along for some decennia. But can those systems cope with the increasing amounts and number of sources in a Big Data environment?

When we focus on the first of the Big Data challenges, the capture of data, the level of challenge is mixed. It’s no real problem to suck in large amounts of data. Well, that been said, this can still pose some major technical issues. But I don’t want to focus on those technical issues here.

The issues I want to talk about are what we do with the data we capture, the data from sources we tap into. What data do we keep and what data do we discard as redundant, obsolete, trivial and irrelevant. It’s all about keeping your collection of Big Data fit for purpose, now and in the future. This also implies that you know what you store. So one of the tasks of capturing is filtering the data to only keep the relevant information. Though some authors think that storing all data is a viable strategy, I am of the opinion that selecting data is essential to keep your datasets governable and healthy. Selecting data starts at the gate: the capture process.

The Barriers to Big Data ImplementationIn Capgemini’s study “Big & Fast Data: The Rise of Insight-Driven Business”, 33% of respondents mention the high cost of storing and manipulating large data sets as a barrier for implementing Big Data solutions. Selecting the relevant data before storing them, will certainly reduce the costs of storage.

Archival back log

Storing large amounts of data has always been the domain of archives. Not so long ago, information was stored on paper and kept in books and folios, organized and kept in archives. This still happens. And still archivists are struggling with the capture of the “old” formats of data of the last century: books, paper documents, electronic files and emails. We’re still discussing how to store them in an orderly manner in an electronic archive. But the world moves on.

Quite a lot of archives suffer from what they call a backlog. These archives are not equipped to process the increasing amounts of incoming documents. This backlog worsens when more data sources are to be archived. The data sources we call Big Data. This problem, which is now a discussion topic under archivists, may hit you too. You might think: this has nothing to do with my organization. But when you aren’t prepared for the major influx of data, you may suffer a backlog too. When even the specialists aren’t coping, how can you be sure you can?

But where backlogs are a serious operational problem for archives, it can have more repercussions for your business. When you don’t have your data available in time due to a back log, you might miss important upcoming business threads and opportunities. That’s a back log you cannot eliminate anymore.

In my next blog I’ll describe some of the advanced solutions and techniques available to help selecting and classifying unstructured content automatically. And how machine learning and Artificial Intelligence (AI) will help you cope with the large flood of data.

Photo CC BY-SA 2.0 by ami photography via Flickr