More data may mean more problems

Capgemini

6 Oct 2021

When it comes to keeping data, “Mo’ Data, Mo’ Problems”

Faced with the problem of dwindling storage space and retaining our cherished possessions, most of us would prefer to keep our precious objects “just in case”, rather than throw them out. The same is true for organisations and their data, it’s easier and simpler to buy more storage than to take the time and effort to review and spring clean their data ecosystem.

This analogy is limited, I can’t just ‘buy’ more loft space to continue with my hoarding habit (not easily anyway), but this is a perfectly viable, and on the face of it a sensible option for data storage.

After all, the cost of storage continues to fall while the potential benefit (the value of data) continues to rise – so the case for the benefit of retaining data is easy to establish (even if it’s not easily quantifiable).

And then there are the intangibles – who has the time or expertise to work out which data to delete? You never know whether there are valuable nuggets in the mountain of data just waiting to be mined by the new (or next) generation of tools. Or better yet, there may be an organisation in the wider data ecosystem that would be willing to pay for your data, as described in Capgemini Research Institute‘s report “Data Sharing Masters: How smart organizations use data ecosystems to gain an unbeatable competitive edge”[1].

While there’s guidance available on how to assess the value of your data, for example Jennifer Belissent’s paper “Determine Your Data’s Worth: Data Plus Use Equals Value”[2] there’s less discussion about the costs and risks involved in hanging on to your data.

This article is the first of two parts to present a few counter-considerations you might want to bear in mind when weighing up whether to retain or destroy your data. The focus in this part is on costs that arise from the nature of the data being retained, while part two presents costs incurred by the mechanisms used to persist, access, and generally manage the data (independent of the ‘payload’).

Costs from Data

Focusing on the content of the data, rather than the processing of the data, this is the area where data value comes to the fore – but so do the risks and costs:

Bruce Schneier presents a well-reasoned, thought-provoking, and informative argument about the risks of retaining data in his article “Data is a Toxic Asset”[3] and lays out the case for the risk data poses when it falls into the wrong hands.
Reputational damage via the discovery of embarrassing content can be significant. As Coca Cola discovered when the California-based public interest group “US Right to Know” successfully won a Freedom of Information Act request that highlighted Coca Cola’s attempts to influence the US Centres for Disease Control and Prevention (CDC) and World Health Organisation (WHO)[4].

The potential value of personalised data is clear “According to the European Commission, by 2020 the value of personalised data will be one trillion euros, almost eight per cent of the EU’s GDP”[1] but in addition to the risks of a data breach, it’s important to factor in the costs to anonymise, obfuscate and desensitise data. As we can see from work around Differential Privacy, it’s not easy, or perhaps even possible, to anonymise data sets fully, and so managing the release of anonymised data will incur costs to cover the management and handling of the data.

To be clear, the potential cost of a data breach is daunting – HIPAA violations can range from $100 USD to $50 thousand USD per record, while a GDPR breach can be up to $24 million USD or four per cent of annual global turnover (whichever is greater). Thirani and Gupta in their article “The Value of Data”[5] reported in 2017 that Equifax had a class action lawsuit of up to $70 billion USD being brought against it resulting from a data breach involving the data for 143 million users.

Another impact associated with GDPR involves the increased cost and complexity of processing of a Subject Access Request (SAR) due to greater volumes of data being searched, another reason to destroy unwanted data. This applies to standard searches across your data estate too – more data means longer search times to find the data that you are looking for.

Next time…

I’ll be looking at the Costs of Data – the costs of storing, managing, and stewarding the data that you’ve decided to keep, and round off with a gameplan to keep your data under control.

References and Further Reading

[1] Zhiwei Jiang et al, “Data Sharing Masters: How smart organizations use data ecosystems to gain an unbeatable competitive edge” (Capgemini Research Institute 2021) – https://www.capgemini.com/gb-en/research/data-sharing-masters/

[2] Belissent, Leganza and Vale, “Determine Your Data’s Worth: Data Plus Use Equals Value”, Forrester January 4, 2021

[3] Bruce Schneier, “Data is a Toxic Asset, so why not throw it out?” (CNN March 1, 2016) – https://edition.cnn.com/2016/03/01/opinions/data-is-a-toxic-asset-opinion-schneier/index.html

[4] Richard Belcher, “Emails show Coca-Cola exec’s attempt to influence CDC, WHO on sugary drinks” (WSB-TV Channel 2 Atlanta, May 22, 2019) – https://www.wsbtv.com/news/2-investigates/emails-show-coca-cola-execs-attempt-to-influence-cdc-who-on-sugary-drinks/950895463/

[5] Vasudha Thirani & Arvind Gupta, “The Value of Data”, (World Economic Forum September 22, 2017) – https://www.weforum.org/agenda/2017/09/the-value-of-data/