Insights & Data Blog

Insights & Data Blog

Opinions expressed on this blog reflect the writer’s views and not the position of the Capgemini Group

The conundrum of Big Data

As with many buzzwords, ‘Big Data’ is an ambiguous term used often by professionals that try to seize a new market opportunity. Yet ‘Big Data’ signifies an important shift in the Information Revolution and it is critical that we clarify both its domain and its principal challenge. Terms like ‘Big Data’, ‘Data Science’, and ‘Analytics’ are used interchangeably to characterise the information potential of sizeable datasets. This is, though, inaccurate as each term represent distinct concepts in the Analytics-as-a-Service  (AaaS) ecology.

‘Big Data’ (BD) refers to the accumulation of data that cannot be processed or handled using traditional data management tools and processes. It is a data management architecture challenge characterised succinctly by the ‘four-V’ model (Volume: size of data, Variety: data in multiple forms, Velocity: data in motion, Veracity: data accuracy). Google revolutionised the IT approaches in this domain and popularised the importance of handling voluminous data as ‘information currency’ for competing in today’s markets; the term ‘Big Data’ has come to represent the simple yet seemingly revolutionary belief that data are valuable.

In this new competitive paradigm, the agents responsible for liberating and creating meaning out of raw [big] data are the Data Scientists (The Data Science Association, 2013). In contrast to popular belief, ‘Data-Science’ was suggested as far back as 2001 as a rebranding of the field of ‘Data-Mining’ to reflect advances in computing with data (‘Data Science: an action plan for expanding the technical areas of the field of statistics’, W.S. Cleveland). That same year saw the introduction of the Sarbanes-Oxley audit reforms (following the Enron scandal) that forced businesses to systematise their controls for financial reporting and invest in large-scale data processes to support performance planning. 

Despite two decades of intensive IT investment in data [mining] applications, recent studies show that companies continue to have trouble identifying metrics that can predict and explain performance results and/or improve operations. Data mining, the process of identifying patterns and structures in the data, has clear potential to identify prescriptions for success but its wide implementation fails systematically. Companies tend to deploy ‘unsupervised-learning’ algorithms in pursuit of predictive metrics, but this automated [black box] approach results in linking multiple low-information metrics in theories that turn out to be improbably complex.

There is no better case in point than Google’s Flu Trends. Back in 2008 the Internet colossus launched a web service that claimed to trace the spread of influenza by finding correlations between web-searches and whether people where exhibiting flu symptoms. The original success of the platform was pronounced by its theory-free, fast-processing predictions; a whole week faster than the Centre of Disease Control’s approach that relied on reports from medical check-ups.  The search engine algorithms would become emblematic of a new data-processing paradigm that businesses could aspire to and compete on, since with ‘enough data the numbers speak for themselves’ (Anderson, 2008, Wired).

This trend was only set to grown exponentially from a recent rebranding of data-mining into ‘Data-Science’ in this era of Big-Data. This is powered by a multi-billion dollar industry of software-vendors that offer specialist tools for data-discovery, data-mining, and predictive-analytics (Qlik Sense, Wave Analytics, SAS Visual Statistics, etc.). IT departments view this trend as an evolutionary step in the information ecology and act as catalysts in the adoption of such technologies, typically managing such programs as another MIS implementation.

But no matter how scalable and capable the implemented data technology or how sophisticated the Artificial Intelligence (AI), the human factor is an essential prerequisite for the successful implementation of algorithms and data-analysis processes. It is simply impossible to identify data metrics high in information content and impact [on explanations and predictions] without hypothesis-driven learning. To put this in perspective, the totality of the world’s supercomputers and AI with access to unlimited data would be unable to replicate the fundamental structure of an economy like the system of 100+ equations that comprise the UK macroeconomic model ( Such models are the culmination of deep insight into the workings of a complex system that has been purely gained by data-hypothesis strategies prescribed by scientists such as statisticians, econometricians, and business-analysts.

This was very undoing of Flu Trends, as Google’s engineers only cared about correlations rather than causations in the data. The finding of patterns was easier and cheaper, but a theory-free analysis of correlations is inevitably fragile and prone to collapse with a constant refresh of data. In Google’s case, flu scare stories and an inappropriate linguistic processing increased the frequency of related-searches to the point where the Flu-Trend algorithms artificially exacerbated the predicted spread of the disease.

Hypothesis-testing, the process of analysing data to explain and check the validity of specific ideas, is the essence of Analytics. It is an exercise of problem-solving using data and science consistent with the ethos of A.Einstein’s quote ‘if I had only one hour to save the world, I would spend 55min. defining the problem, and only 5min. finding the solution’. By extension, the offering of Analytics-as-a-Service reflects the provision of consulting services across the chain of Analytics-related activities (Understanding of the client problem→ Data-management and processing→ Data-investigation techniques→ Statistical analysis to quantify ‘uncertainty’ in the data→ Model deployment→ Implementation of solution). In this context, Big-Data is just an auxiliary process in a series of analysis activities. The correct definition of a client problem might not even require ‘big’ data. Datasets will invariably ‘shrink’ in dimensions as the analysis work is on subsets of variables that have statistical significance, and sampling techniques promote the use of compact yet generalisable datasets.

When it comes to data, size isn’t everything because big data on their own cannot just solve the problem of ‘insight’ (i.e. inferring what is going on). The true enablers are the data-scientists and statisticians who have been obsessed for more than two centuries to understand the world through data and what traps lie in wait during this exercise. In the world of analytics (AaaS), it is agility (using science, investigative skills, appropriate technology), trust (to solve the client’s real business problems and build collateral), and ‘know-how’ (to extract intelligence hidden in the data) that are the prime ‘assets’ for competing, not the size of the data. Big data are certainly here but big insights have yet to arrive.

About the author

Dimitrios Chalvatzis
Dimitrios Chalvatzis
I have a genuine passion for Statistics and been developing data analytic and visualisation solutions for more than 18yrs. My current client engagement is with a Global FMCG to deliver data-science on the fascinating domain of unstructured Social Media data.
1 Comment Leave a comment
Informative post! Realistic assessment of BD.

Leave a comment

Your email address will not be published. Required fields are marked *.