Unlocking the business value of Open Data with Data Science

Publish date:

Mention open data to a CEO, CIO, or any other C-level executive, and you are likely to be challenged with the question, where’s the business value? To be fair, they have a point. While there has been a great deal of investment in the creation of open data, particularly within the public sector, evidence of its […]

Mention open data to a CEO, CIO, or any other C-level executive, and you are likely to be challenged with the question, where’s the business value? To be fair, they have a point. While there has been a great deal of investment in the creation of open data, particularly within the public sector, evidence of its successful exploitation by business has been far less obvious. However, when a global consultancy like Capgemini Insights and Data routinely explores the potential for open data sources to add value for major corporate and government projects, there is clearly something in this space that C-level executives and their businesses need to be aware of.

This was in evidence at the recent Open Data Innovation Summit in London, with presentations from international organisations such, as GlaxoSmithKline and United Nations, and UK bodies like Defra, Environment Agency, Food Standards Agency, Land Registry, Met Office and NHS Digital, as well as research institutions like the University of Bristol and the European Space Agency. This article was co-written with my fellow Data Scientist colleague, Andy Challis, in response to our experiences at the Summit and to the increasingly important profile of open data in our consultancy work at Capgemini.

Open data is data that anyone can access, share or use. By definition, open data is freely available to every business and so would seem to have little inherent competitive advantage for one business over another. Likewise, the concept of a business publishing its treasured data assets as freely available open data appears to be completely at odds with the intuitive advantage of maintaining exclusive exploitation rights. Why then were such a wide range of organisations reporting on their open data activities at the Summit? The not-for-profit organisations could perhaps be explained away by their ‘public good’ mission but that would not explain the significant commercial representation at the Summit – as both consumers and, perhaps more surprisingly, producers of open data. To answer this question, we review a selection of presentations that illustrate business value opportunities for open data and the key role that data science plays in realising that value. In particular, we highlight how machine learning and AI are being applied to the following three types of open data to enhance existing private ‘closed data’ for quantifiable business benefit.

  • Geospatial – mapping, boundaries, locations, routing
  • Environmental – weather, air quality, and other sensor data
  • Behavioural – mobile tracking, ticketing, webcams, cctv

We have included a few examples of these open data types at the end of this article. In the next section we report on four case studies, presentations from the Summit, that illustrate different approaches to unlocking business value from open data. Finally, as far as commercial sensitivities allow, we sketch some of the exciting ways that Capgemini’s data scientists are helping clients to go from open data ideas to real business value.

Case Studies

Below we discuss presentations from Transport for London (Case study 1, The World Bank (Case study 2) and Experian (Case study 3). As an interlude, we then include a brief overview of the open data ecosystem which serves as a prelude to the holistic vision for open data advanced in the data.world presentation (Case study 4).

Case study 1: Transport for London

Transport for London publishes over 200 types of open data through its TfL API in pursuit of its mission to, “Keep London moving, working and growing and make life in the city better”. Rikesh Shah, TfL Digital Manager, cites an estimated £30m-£116m annual benefit to TfL from its open data activities – a figure which has doubled over the last year. Other reported benefits to TfL include improvements in transparency, reach, niche products and innovation.

With an ecosystem of over 12,000 registered developers and 600 apps built using their open data API, TfL also derives benefit from an improved overall customer experience which in turn improves trust and loyalty. Boasting partnerships with Apple, Twitter and Google, TfL is actively promoting the consumption of its open data and reaching beyond its traditional website or app user base. Third-party apps further extend this reach and, for companies like Citymapper with their popular eponymous app, their business model is entirely enabled by the availability of TfL open data. While Citymapperis probably the best-known use case, overall, TfL estimates that around 1,000 jobs are enabled by their open data ecosystem.

Citymapper itself is an interesting use case because they have used data science to identify a new business opportunity. In their words, “First we built an app to help you get around town, using open data. … We also built tools to analyse the data and learned a lot about how people are moving around. When we studied the existing public transit routes, we realised that they don’t always serve people best, nor evolve quickly enough to accommodate changes in the city.” Based on this analysis, Citymapper launched their innovative Smartbus service to meet a need predicted by the data but currently unmet by traditional bus services. Coming full circle, Smartbus will be publishing its own open data after the TfL model.

Case study 2: The World Bank

The World Bank presentation approached open data from a different direction: risk assessment. Edward Anderson, Senior ICT and DRM Specialist at the Bank, described the gulf between theory and practice when trying to assess natural disaster risks around the world, in an attempt to identify and mitigate for events like the New Orleans 2005 hurricane flooding or the Kathmandu 2015 earthquake. Anderson stressed that while powerful mapping and analytical tools were available for hazard mapping, exposure mapping and vulnerability assessment, they all assume the existence of high-quality data (geodata) describing the locations of interest. Unfortunately, much of the world’s population live in the poorest areas of cities for which little or no detailed geodata exists.

Traditional approaches to geodata acquisition are complex, expensive and the resultant data rapidly becomes outdated for areas with uncontrolled urban growth. So, the Bank enhanced its own data and predictive models with geodata from two of the largest and best-known open data sources, OpenStreetMap and Wikipedia. For the poorest neighbourhoods though, even these sizeable resources have gaps in their coverage. However, Anderson described the innovative 2016 Ramani Huria Open Map community project in Dar es Salaam that has crowdsourced the missing geodata for their locale by tracing open satellite imagery to produce geodata which they then annotate with metadata describing each entity. This process makes use of OpenStreetMaps built-in tools and they elected to contribute the resultant geodata data back into OpenStreetMap, filling in previously blank spaces on the local map. Approached initially as a pragmatic cost saving solution the initiative has seen local uptake, usage and curation of this open data, which now describes 160,000 buildings and over 500km of waterways. Using this crowdsourced open data the Bank has created an effective scenario planning tool to pro-actively plan risk mitigation, delivering real ‘business value’ to the Bank and the community.

Case study 3: Experian

Experian is a data business and so it might come as little surprise to find that they are avid consumers of open data. In his presentation, Paul Malyon, Data Strategy Manager at Experian, reported on their work to score the quality of valuable datasets for their business, specifically: Companies House Register, Land Registry Price Paid and NHS Choices GP Practices and Surgeries. Maylon summarised the data quality issues they had encountered and the remedial steps that Experian take to address these prior to data analysis and producing focused reports for their customers. He pointed out that the business consequences of poor quality data for Experian’s customers include exposure to ID fraud risk, inability to access financial services, failure to discover relevant properties through legal searches, errors in reported National Statistics, and increased A&E traffic at hospitals.

For a quality scoring framework, Experian worked closely with the Open Data Institute(ODI) to develop a “Kite Mark” standard for open data quality. Using Experian’s bespoke Pandora data quality tool, they were able to identify important errors, such as over 88,000 duplicate property sales, around 22,000 sales with no post code and only 14% of countries recorded with standard FCO standard names. Recognising that it was in their own interest to feed this information back to the data publishers, Experian are helping to raise the quality of the data that their customers rely on. In summarising the next steps for open data, Maylon advocated that quality measures be published alongside data, which in turn requires data quality standards supported by standardised tools and registries.

Interlude: the open data ecosystem

There have been concerted efforts in the past to introduce standardisation into open data, the most notable of which was Sir Tim Berners Lee’s idea of a ‘web of data’, the Semantic Web, and the Linked Open Data (LOD) movement which together produced a rich resource of interconnected open datasets governed by W3C standards for representing and querying this data. Examples of datasets linked into the ‘LOD cloud’ (aka ‘LOD graph’) are Wikipedia, WordNet, Geonames and Flickr. Technologically, the growth in LOD went hand-in-hand with the rise in popularity of graph databases, like Jena and Neo4j, that are now part of the standard NoSQL toolkit for software engineers and data scientists alike. Graph support has since made its way into traditional relational databases like Oracle as well as distributed ‘big data’ file stores, data warehouses and data lakes.

Linked Open Data Cloud

However, the drive to open up data and the relative complexity of publishing data as LOD has resulted in most open data being published ‘unlinked’ through repositories or as metadata embedded in traditional web pages. Publicly-funded organisations and regional government are now encouraged to open up “their” data as a public good. For example, funders of academic research and major journals now demand the publication of data underpinning published research, consequently, generating unprecedented volumes of open data from every discipline imaginable – a freely accessible resource that remains largely untapped outside of academia. The resultant repositories include the various data.gov websites around the world (e.g. data.gov in the US and data.gov.uk in the UK), university research data repositories (e.g. data.bris.ac.uk) and various domain-specific archives. Open data is also shared less formally through a multitude of cloud archives such as The Internet Archive (archive.org), Figshare, GitHub, PubChem, PubMed, Slideshare and, of course, social media.

Repositories, whether formal or informal, include metadata about the data they contain but, unlike LOD, provide no standardised way to query the data (although many offer bespoke REST APIs for programmatic access). Open data as metadata embedded in HTML is dominated by the schema.org, a joint initiative between the leading search engines to simplify the publication of machine-readable metadata about, primarily: products, services and prices. Whereas LOD makes querying open data fairly straight forward, data published in schema.org format requires software consuming the data to crawl websites and aggregate (‘harvest’) the data themselves.

Case study 4: data.world

data.world addresses a key problem with the increasing volume and variety of open data now available: finding exactly the right dataset to add value in a specific business setting, without a priori knowledge of what that dataset is called and/or who published it. In his presentation, Matt Laessig, co-founder of the data.world, set out the corporation’s bold ambition to build a searchable catalogue of all open data, linked or unlinked. Noting that traditional search engines like Google are optimised for textual search and offer little support for finding open data or its associated metadata and licences. Filling this gap in the market, data.world aims to become the Google of open data, not through some new search technology but instead by crowdsourcing the catalogue through its rapidly growing “social network for data people”. Although still in beta, data.world already catalogues an eclectic range of open data resources for agriculture, education, finance, government, health, public safety and weather – ranging from the niche, “Restaurants That Sell Burritos and Tacos in the U.S.”, deep-dive “WikiHow’s Human Instructions: 200,000+ formalised step-by-step instructions in English”, to daily feeds like the “Global Economic Monitor” and “Cruise Ship Locations”.

From Open Data ideas to real business value

Having identified a candidate open dataset to address a particular business need, the next step is almost always to find ways of linking that data to ‘closed data’ that the business already holds. In an ideal world, there would be a unique identifier (a key) in each dataset to be joined but this is often not even the case within a business’ data holding, let alone between arbitrary datasets. To solve this problem, Capgemini data scientists develop automated matching tools for clients and, when working at the scale of millions or billions of records, deploy tools like IBM Big Match to intelligently join using a mix of off-the-shelf and custom matching algorithms.

Moving from the abstract to specific real-world examples, Capgemini Insights and Datarecently completed data science projects using a variety of open data sources for two utility companies. In the first example, we used live traffic camera open data feeds from Transport for London to develop a dashboard to aid the company in managing its engineers. If a fault has been detected in one of their regions then the dashboard can help them to hone down issues – to look for any building work going on nearby – to see what the live environment is like and whether their van get access. The team also gave the client options for future work using image classification to give a last known whereabouts of their workforce and assets.

In the second example, we leveraged open data from the Met Office to enable a water utilities company to predict the demand on the network at the reservoir level. Thus, enabling the simulation of scenarios for network operations – for example, flushing of the reservoir or shutting off pipes for certain periods, as well as knowing when the cheapest time to do this would be based on the cost of electricity at that time of the day.

Both of these Insights and Data projects used OpenStreetMap data for visualisations which we enhanced with the clients’ private data (e.g. regional polygons, pipelines and points for meters and leaks). However, we also undertook further work to enhance their data, for example extracting extra features using the polygon regions combined with a satellite images from Google Maps. This could then be used to model the demand of the regions – for instance, if one area has more agriculture then is more likely to draw larger amounts from the network when it has been hot and not rained for several days.

These innovative applications of data science along with our choice examples from the Open Data Innovation Summit concretely answer the C-level executive’s question asking where the business value of open data lies. At the heart of these open data applications is the idea of enriching and integrating confidential ‘closed data’ through data science. Indeed, as the first author highlighted in his own presentation at the Summit, recent technological advances are now enabling secure, non-disclosive open data science on highly sensitive closed data – opening up exciting collaborative opportunities within and between businesses in highly regulated sectors like defence, heath, insurance and pharma. Within Capgemini Insights and Data, we are continuously identifying such new opportunities for clients to gain actionable insights from the inclusion of open data in their advanced data science projects. If you would like to discuss open data possibilities for your business, in any sector, anywhere in the world, then please get in touch.

Appendix – example open data

To give a flavour of the variety of open data available, the table below lists a few examples of data sets. Many more are catalogued at data.world and in other open data repositories, registries and catalogues.

Data Set   Examples
List of names by gender
Covers data filtered from SSN names, US, UK and Indian Names.
(117,950 records)
name gender score
Sinead 0 1
Sinem 0 1
Sinforosa 0 1
Sing 1 0.88
Singh 1 0
Every place name in the US
Names geodata from the Geographic Names Information System (GNIS).
(335,680 records)
feature_name state_alpha prim_lat_dec prim_long_dec
Curry AK 62.614 -150.011
Lingo AK 61.202 -149.920
Lenoir AR 34.748 -91.270
Tampico MT 48.304 -106.827
Tattnall MT 48.627 -107.542
Newspaper and magazine images segmentation dataset
Training and validation of classification regions of documents on text, picture and background areas.
(101 images and ground truth masks)
Scanned images of various newspapers and magazines in Russian. For all images ground truth pixel-based masks were manually created. There are three classes: text area, picture area, background. Pixels on the mask with color 255, 0, 0 (rgb, red color) correspond to picture area, pixels with color 0, 0, 255 (rgb, blue color) correspond to text area, all other pixels correspond to background.
UK food hygiene rating
Food hygiene rating or inspection result given to a business and reflect the standards of food hygiene found on the date of inspection or visit by the local authority. Businesses include restaurants, pubs, cafés, takeaways, hotels and other places consumers eat, as well as supermarkets and other food shops. Includes geolocation of every inspected business.
(551,370 xml records)
	<AddressLine2>1 Rivergate</AddressLine2>
	<PostCode>BS1 6ED</PostCode>
Alcohol consumption by country
World Health Organization, Alcohol consumption by country (2010).
(193 records)
country beer_servings spirit_servings wine_servings
Afghanistan 0 0 0
Albania 89 132 54
Algeria 25 0 14
Andorra 245 138 312
United Kingdom 219 126 195

About the authors

Simon Price and Andy Challis are data scientists in Big Data Analytics UK, Capgemini Insights and Data.

Related Posts


The reinvention of innovation by global brands

Rakesh Goel
Date icon September 9, 2021

And how you can craft your own secret innovation sauce for your business


The secret sauce of innovation

Rakesh Goel
Date icon September 9, 2021

Innovation models and fresh recipes that can take your business to the next level

Insights & Data

Gesture recognition for a safer, more inclusive society

Date icon August 12, 2021

The emergence of hot tech: Gesture control and touchless user interfaces ~ for a low-touch,...