Search Cap.jpg
There’s more information on the Internet than one could apprehend in a hundred lifetimes, and it’s growing too – and (most of the times) kept up-to-date. Different organisations, places and networks holding that information make it hard to get it all together, so how to make that information homogenous, and uniformly accessible?
Can it be done? Should it be done?
Over the last decade we went from data to knowledge.The World Wide Web has linked companies and consumers in the last decade. This inspired the more shy organisations to build intranets, where only company people would find each other
That was about linking data: the same things could be done in a new way
Then web shops came. Forums. Peer to peer networks for sharing ever legal and not so legal delight. All that came and became mainstream so fast that the inter-intra move didn’t even have time to “happen”.
That was about connecting, and information: more or less new things were done in new ways
In the last few years, social networks conquered the digital earth: LinkedIn, Facebook, Twitter. Such a different kind of behaviour, it was absolutely new. It was using the infrastructure already laid out (computers, networks and people using those) to build upon.
That was about sharing information, acquiring knowledge: entirely new means to an entirely new end.
Meanwhile, WikiPedia was born. An unprecedented source of information with more than 14,000,000 articles in more than 260 languages.
Stats in monthly unique visitors for all that: LinkedIn 15 million, Facebook 130 million, Twitter 55 million, WikiPedia 60 million.
That’s a lot of data, information and knowledge. And it’s all out there. Wait, where?
Yes, it’s all out there, pretty much. Google helps us in finding it, almost real-time. We’ve seen some struggles with Facebook that treated their data as a walled garden, but they’re slowly opening up too. With Google starting to index books, video and other content, all knowledge in the world starts to become available online and realtime.
But it is scattered all over the place, in different forms, behind different doors: not uniform or homogenous. It’s very diverse
The Integration theme: overcoming diversity
Integrating applications, departments and companies has shown this same theme over the last decades: diversity in form, location and accessibility has to be overcome.
The European Parliament shows how that can be done: introduce an intermediate language (or two, in that case), support different communication channels, and facilitate-by-translation.
That works very well for all: the focus and attention remains on the “stars” themselves, the highly specialised participants. Like business shouldn’t be bothered with IT, they aren’t bothered by the linguistical barriers and can just move in and out.
There’s a big precondition to all that though, which is that the semantics are agreed upon beforehand. In the European Parliament somehow magically, changing semantics are picked up by all parties involved. Now how does all this work in the World Wide Web?
The first WWW problem: different format
Structured versus unstructured versus semi-structured. HTML, text, .doc, .PDF, Facebook updates, Tweets, it’s all different. However, search engines make all of that transparent. After all, there are only so many syntaxes around. Of course, visuals like video and images are an entirely different topic, but even those are magically informated by Google
The second WWW problem: different location
Is it on the web, or behind a company firewall? Does it need authorisation? Only what is openly available can be searched. And it doesn’t matter whether it is located north, east, south, west, or orbiting around earth. Search engines make all of that transparent too
The third WWW problem: different languages, dialects and typos
It still takes too many rules to perfectly translate a language to another one. English is widely present though, and there are as many typos and spelling errors made by native speakers as by foreigners. All that has to be taken into account as well. Luckily most search engines do. They suggest correct spelling if you search something and misspell it. They’ll even include misspelt search results
The real WWW problem: different semantics
The biggest problem is (changing) semantics. Wikipedia spends pages and pages on disambiguation explaining the differences between one word or acronym, and the other. The word web, for instance, can have entirely different meanings in different contexts. Even if, across all different forms, locations and languages, you are looking for the word web, what is the context you want to place it in? Heck, you might not even know that yourself…
The best example of how vivid a semantic discussion can be is the initial discussion around  E2.0 and Social Business Design
The possible solution: autonomous tagging
Tagging is a way of labelling a piece of information with a single word or phrase. Tags are decided upon individually by humans, in relative isolation. There is no central, global tagging system where one can pick their tags from. Although tags now also are a form of language or at least communication. If information were to be tagged, these tags could be translated or related, and form connections across all diversities.
What if there were a tag knowledgebase much like today’s Wikipedia? Where tags are maintained, explained, etcetera? This would be the ultimate source of metadata, making it possible for the Single Source of Search to be conducted. Its interface could be defined and plugged into, and it would be the single source of truth for the Semantic Web
Bots could crawl the entire Web tagging information whether it’s HTML, PDF, Video or images or whatsoever
In my last post I explained about the position and quality of humans versus machines. This very complex and dynamic terrain is definitely something that needs to be explored and maintained by humans first. When that’s succesful, we might be able to automate that, and skip to the next level: wisdom
Thanks to Paolo Saitti for asking a few nasty questions and giving a few nasty answers at the same time; here they are:
Tagging however will require certain semantical and cultural alignment. The level of knowledge also dictates the detailedness of search. SQL Server is too generic if you want to find something about SQL Server 2008 Service Broker – but you need that level of knowledge to be able to know the difference!
At the basis, agreed semantics are needed. Human beings simply call it a dictionary. The traditional dictionary experience is quite enlightening: new words spring up from informal usage, often from jargon language. This is the dynamic and “democratic” evolution of natural languages. As they become (very) common, the compilers of the dictionaries decide to include the new terms in the “official” language. Here an upper authority is required to provide a formal unambiguous definition for the new word.
Distributed, informal, continously evolving tagging (a bottom-up process) is enough for human interactions. On the other hand we’ll need a formal, robust and agreed tagging dictionary with a consistent effort to develop and maintain it (a top-down process) in order to build semantic applications exploiting contents over the web.

Martijn Linssen is Enterprise Integration Architect within Capgemini. You can find him on Twitter. Paolo Saitti is Enterprise Software Engineer within Capgemini. You can find him on Twitter