We hear a lot about the impending shortage of data scientists. McKinsey started the fuss with their big data report back in 2011 in which they stated that the US alone faces a shortage of between 140,000 and 190,000 people with deep analytical skills[1]. And we have heard the issue repeated many times since. But actually so far this has not proved to be the case. At Capgemini we have lots of big data projects and, don’t get me wrong, data scientists are key to the solution but actually the biggest challenge is often the provenance of the source data. The old adage that “you only get out what you put in” is never truer than when you start to build analytics on big data. Two factors are driving this. Firstly we are using data from a lot of disparate sources; not the traditional enterprise data, which whilst not always completely reliable, had a level of control and audit. The second factor is that what we are using the results to drive 100s of small variations and changes in our processes: the cross-sell, the system adjustments, identifying the marginally high risk transactions… and they are “finer adjustments” that need more accuracy not less.
So, more than ever, we need to have the skills to understand the origins and validity of our source data. But this is complex subject. Data sources need to be assessed in 3 dimensions: provenance, legality and sensitivity. Provenance is about where has the data come from. Do we trust that source, what level of quality can we expect in the data? Inevitably many of our sources will be less than 100% reliable. We need to establish what the level actually is and adjust how we treat that data in our analytical models accordingly.
Legality seems fairly obvious on the face of it but there are actually several layers to this too. Clearly we would like our data with full legal rights to use as we want. Many apps on your mobile will ask you to check a box that says you agree to their terms and conditions but when you read the small print you have just signed over the right to let them sell your grandmother. They will often want rights to your full address book and to track you at any time. Most data does not come with this level of release however this doesn’t mean it is not useable. In many cases anonymous and/or aggregated data can still provide a lot of insight. But it will be important to understand what is and isn’t allowed. And this varies considerably between countries and even states.
Sensitivity, my last dimension, is not so obvious but can be crucially important. You may have the rights to use the data but in doing so are you breaching some ethical boundaries? In these days when brand reputation and image can make or break companies, all companies must take a view of what is reasonable and ethical. It is important to think about the impact of people knowing about your use of their data. We only need to look at the recent NSA revelations to understand that something that may be possible may not be seen by everyone else as ethical. Similarly, the example of Target, the US Retailer, using analytics to identify pregnant women[2] crossed these boundaries.
Data forensics – understanding your data sources will be the real skill in turning big data into value.

[1] McKinsey & Co – Big Data: the next for frontier for innovation, competition and productivity, May 2011