More data may mean more problems

Capgemini

6 Oct 2021

When it comes to keeping data, “Mo’ Data, Mo’ Problems”

Faced with the problem of dwindling storage space and retaining our cherished possessions, most of us would prefer to keep our precious objects “just in case”, rather than throw them out. The same is true for organisations and their data, it’s easier and simpler to buy more storage than to take the time and effort to review and spring clean their data ecosystem.

Last time…

In the first article of this two part series I unpacked some of the considerations that you might want to take into account when considering whether to retain data. The focus was on the costs that are linked to the content of the data, including costs that might arise from not complying with regulations such as GDPR.

In this concluding article I cover aspects of the costs and impact of the machinery and processes (infrastructure and management) needed to support your data that are independent of the contents.

Costs of Data

The “cost of data” includes those costs that arise from treating data as a commodity or black box, and includes the costs of capturing, processing, transferring, and storing the data.

At first glance the cost of storing data is simply the $/MB for storage, so the price for another X TB of data is a straightforward calculation. Simple. But when we factor in other considerations such as those presented below, then the picture around the costs of data as a commodity becomes more complex.

Keeping the lights on

Information systems are designed around the needs of the business, and to enable business continuity we enhance system resilience (including data availability) using techniques such as redundancy, which makes the system tolerant to the failure of one or more components.

Approaches to redundancy at the data layer include database replication (in case the primary database fails) or complete duplication of a compute node (hosts and data). Most on-premise organisations have multiple data centres, but for Cloud-based organisations this might include hosting data and compute in multiple regions of the world, in case of a catastrophic failure in one geographic region. The key point is this, the approach taken to redundancy may include multiple copies of the data across the organisation’s data landscape. This not only drives costs for storage capacity, but also costs of network traffic to communicate between the data stores and keep them in sync.

Similarly, around disaster recovery, there may be multiple generations of backup media that provide snapshots of data to enable a failed service to be recovered, so each additional megabyte of data being preserved requires multiple megabytes of backup media storage. The same principle applies if the data is simply archived from the system into offline storage.

Managing replicas

While having multiple instances of your data improves data availability and system resilience, there are challenges with having multiple copies of data too. In addition to increased complexity overall, detecting discrepancies in distributed data can be difficult, and defining and maintaining a master data management strategy gets more difficult.

Updating multiple distributed data sources consistently requires more complex, and so more expensive, architectures and additional costly network traffic.

As well as the financial cost, from a sustainability viewpoint the environmental impact of storing more data drives usage of resources including electricity, but also the increased costs of controlling the heating, ventilation, and air conditioning (HVAC).

Securing your data

Rounding off the discussion about the costs of data the more data you have, the more interested hackers and other actors will be in targeting your data, especially if it’s personal or sensitive. This leads to increasing complexity, sophistication, and costs of cybersecurity protection (network infrastructure, protective monitoring and so on).

3 Things to think about…

Is the data “good enough” to Keep?

Explore the data to get a sense of what it contains (examples include personal data, sensitive data and so on)
Assess the quality of the data (will you be able to realise the value)
Ensure that the information has been collected for the right ‘purpose’ under GDPR to enable you to execute on potential use cases

Does the data have value?

Identify use cases for which the data could be used to deliver business value for your organisation
Identify use cases that the data could drive in your wider data ecosystem – determine which of your partner organisations might be willing to buy your data or trade for sharing theirs

What’s the cost of keeping the data?

Be aware of all the real costs of storing, transferring, governing, and protecting the data, and estimate what these costs are for the data under consideration
Calculate the risks of retaining the information if it should fall into the hands of someone who could misuse it

Having answers to these questions will give you a more informed view of whether the value of retaining the data outweighs the costs and risks.

…And if you do decide to keep the data, do:

Improve the quality of the data as much as is needed that can be achieved practically. For example, normalise data before storage so that you only normalise once rather than every time the data is accessed
Anonymise or mask personally identifiable information (PII) and sensitive data to reduce the impact if there is a data breach or inadvertent release of the data
Catalogue the data, and set a retention policy, so that you have a record of the data you hold and when it needs to be reviewed.