Capping IT Off

Capping IT Off

Opinions expressed on this blog reflect the writer’s views and not the position of the Capgemini Group

Content analytics to the max!

Words have weight, something once said cannot be unsaid. Meaning is like a stone dropped into a pool; the ripples will spread and you cannot know what back they wash against (Philippa Gregory)

Infographic IBM BluemixA couple of days ago, I spoke a colleague about BI analytics and unstructured data. He said that you couldn’t analyze texts, apart from counting words and letters: no meaningful business information. This statement rather shocked me, and it triggered me to write this blog because I want to tell you about the contrary: you can get meaningful statistics from unstructured data.

Of course, establishing meaning from structured data is easier than getting it from unstructured data, like texts in documents, email-messages, twitter feeds, blogs etc. When you have some form of Master Data Management in place, you know what the rows and columns in database tables mean and to which object or attribute in the real life world they refer to.

The meaning of texts depend more on context. In the real life world around us, words can mean more things: you’ve multiple language, you’ve homonyms and synonyms, jargon, humor, subtleties, nuances, poetry, under- and overstatements. So how can you derive meaning out of it and why should you want to do that anyway? In a Business Information context, that is.

Lots of information about you and your company are stored in free format texts. The outside world is still communicating with you with texts, documents, emails, faxes and so on. And that are only the messages in writing. Normally, humans in your company read those messages, interpret them and take the appropriate actions. Sometimes the message is converted into structured data, like in claim processing, complaint handling and marketing communication.

But lots and lots of information about your organization isn’t put into databases at all. Reports that are written by or about you, legal documents, press articles, inquiries and communication, the list is endless. These are all texts that are directly addressed at you. But nowadays people in the outside world write about your company. The quality of products or services you sell, the pricing of your products, the delivery, the promises you’ve made. This communication about you isn’t addressed directly at you, but still determines how people think about you and if (how) they want to do business with you.

In a modern competitive society this information is invaluable. But it is not stored in your databases and data warehouses. It floats around inside and outside your organization. Why not tap into these sources and try to derive useful business information out of it?

Can this be done? Sure. Is it useful? Absolutely. In this globally connected, fast paced world trust, reliability, image and sentiment come and go at high speed. Do you want to be surprised in a sudden downturn in the market? Don’t you want to know when news about you and your company becomes viral? A small incident that becomes huge because everybody talks and writes about it. And don’t you want to know what your customers and prospects really think about your company. You can interview them, of course, but do these interviews really give you the insights you need?

Now you’re interested how this can be done. So let me explain how it works. Text search has been around for some time. Maybe you’ve find this blog by searching for it with an internet search engine. Indexing large amounts of texts isn’t a problem. You’ve probably less documents to search trough than Google does. Asking the right questions in order to find the right documents can be tricky, but the full text search engines have an elaborate querying language, with thing like synonyms, stemming, sounds-like, so the changes that you’ll find the right documents are pretty high.

When we’re able to search and find the right documents, creating statistics about these documents is the next step. That is, creating statistic about the contents of the texts. Simple queries like: How often am I mentioned in this category of news items? Or more advanced ones like: What do my users think about the quality of those products?

But as the questions get more complex, more meaning comes into play. So as my colleague would say, how do you know the meaning of these words and sentences? How can you derive statistics based on the meaning? Here the new advances text analysis tools come to the rescue. By feeding the analysis tools with enough (sample) documents, the system can learn meaning. Of course with a little help from humans. And humans will still have to interpret the findings of the analysis, but what’s new in that.

IBM Watson, the advanced text analytics tool, has the capability to analyze texts, large quantities of text. Petabytes. For instance, IBM and Twitter are making all tweets available for analysis. That’s a lot of text and it probably contains tweets about you. Other data sources can also be accessed for analysis. Internal document repositories, public information, news feeds and so on. It’s all there for you to analyze and act upon.

Isn’t Watson expensive? Not anymore. In the IBM Bluemix cloud based environment Watson is available (currently in Beta). Free for trails. Powerful tools for text analysis and statistics (SPSS) are available for everybody, small and large. Give it a try and discover the possibilities.

Unstructured data, like documents and other texts, are no longer a closed, inaccessible pool of data. High performing analysis tools can now delve into these large quantities of data and derive useful statistics about contents. I’ve talked mainly about marketing analysis, but ground breaking text analysis has also be done in the medical field, manufacturing and quality control, legal, and compliance and governance. And I didn’t even mention the possibilities to start a question and answer knowledge base. But from this moment on, there’re really no obstacles left to stop you exploring the fast domain of unstructured data.

Screenshot copyright IBM via IBM developerWorks.

About the author

Reinoud Kaasschieter
Reinoud Kaasschieter

Leave a comment

Your email address will not be published. Required fields are marked *.