“Time is made up of captured moments and things shared…” (unknown)

collection of booksIn the previous blog I’ve discussed why selecting data is essential to keep your datasets governable and healthy. Selecting data starts at the gate: the capture process.

When you aren’t prepared for the major influx of data, you might miss important upcoming business threads and opportunities. That’s a back log you cannot eliminate anymore. This blog post will describe the advanced solutions and techniques available to help selecting and classifying unstructured content automatically, allowing you to cope with the large flood of data.

In Capgemini’s recent study, conducted in collaboration with Emc², “Big & Fast Data: The Rise of Insight-Driven Business” one of the principles – central to making the necessary digital transformation – is: “Enable your data landscape for the flood from connected people and connected things. There are many new technologies that enable the capture and management of the data flood.” Automatic classification is the way to go to control this flood of incoming, unstructured data.

Automatic classification to the rescue

Major vendors of tools for data and document capture have acknowledged that manual selection and classification of incoming data is impossible. The amount of data is just too large. So systems to automatically classify data have been developed. These tools are in fact essential in a Big Data environment. The tools can process large amounts of data, using rules and Artificial Intelligence (AI) to determine the content, valuate the content and classify the content so it can be stored in the right place, using the appropriate information governance policies. By the way, this also helps imposing the correct privacy policies on the data.

Classification can be used in several ways. The one we most need is the so called Content based classification. This is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. Modern tools however, don’t need to be set up extensively on beforehand. They’re capable of machine learning. Based on what a human operator shows the software, the software can automatically determine the rules of classification. After the learning phase, the software can start classifying on its own, asking for human input only in exceptional cases. When you use classification schemes form outside, for example schemes used in your industry, the learning cycle can be shortened dramatically.

With the raise of cognitive computing, content filtering and classification becomes even more intelligent. Deep learning platforms enables you to use cognitive-infused applications with advanced data analysis capabilities such as taxonomy categorization, entity and keyword extraction and sentiment analysis.

Cognitive computing can also help you to tap into new data sources, not only written texts. Voice recognition can create automatic transcripts of verbal conversations, for instance calls your service employees make with your customers. These transcripts can be input for further text analysis. Or image recognition, allowing analysis of the ever growing number of photos people make and share.

Analyzing Big Data for business purposes is one thing. Using it in your business processes and decision making is another. But without the appropriate and fit-for-purpose data, any analysis will miss its goal. Tapping into data sources and selecting and classifying the data is essential for obtaining the good quality datasets for analysis. This issue is prominent with unstructured data, now becoming the largest data source in this world. Modern tools, however, using machine learning and cognitive computing, will help you managing the large influx of data in your Big Data environment.

Photo CC BY-SA 2.0 by ami photography via Flickr