“Have nothing in your houses that you do not know to be useful or believe to be beautiful.” (William Morris)
Data growth is becoming a major topic. Not only for internet firms and cloud providers, but also for your organization. Databases keep growing because more information is gathered and stored. This growth of data seems to be a phenomenon that cannot be controlled. We’ve to live with it and hope that developments in hardware and software will keep up with the ever growing data volumes.
When you regard data growth as a storage issue, and where adding storage looks like the solution, you might have considered multi-tiering and data archiving as a method of adding storage cost effectively. By moving data to storage devices with less operational costs you save money on the way.
And for some people, even suppliers, that’s the complete story. Archive your data on cheap media and keep it forever. Why? Because storage is cheap and deleting or purging data is complicated. Because I don’t know when I might need my old data again. When I want to purge, I do need to know retention schemes, the actual use of the data, possible legal holds, the business value of the data and a whole bunch of other things. And moreover, I need to know my database schemes. I need to know how I can delete records without corrupting the database integrity. Is it worth the effort to discover and implement retention policies? Well, that’s the question. Let me give an incomplete list of reasons why deleting archived data is useful, even obligatory.
Mounting costs of archive storage
The larger the data archive grows; more money has to be spent for keeping the archive available. Archives certainly have lower SLA’s to meet, but still archives contain data with business value that has to be available. Data archives take up rack space, consume power and require cooling.
There are cost overhead associated with the amount of data that we store. This includes the cost of storage medium, infrastructure and human resources necessary to manage the data. Keep in mind that the data might be replicated in backup systems, test systems and so forth. I know organizations who replicate data ten times, multiplying storage costs also with ten. Multi-tiering helps to reduce costs. But compared to the deleted date, it’s still a lot of money.
Decreasing performance and reliability
The amount of data is stored in the data archive impacts the performance of the archive itself. More data means more time consuming discovery, reporting and maintenance operations. The business side may not be interested in the very old fact and figures. Data with no business value consume space. And it can lead to errors in reports. We might assume what old data means, but are you still using the same customer number scheme from ten years ago? Have you re-cycled article numbers? Will we calculate our KPI’s the same way forever? Data not only loses value because we don’t use it anymore, it also loses value because we cannot use it anymore.
Retaining such impertinent data may not be required and may even cause errornous reporting. It will not only cost money, but can also polute your insights into your data.
Still open for security risks
Like every piece of data, archived data is also prone to security risks like hacking, unlawful disclosure, data theft and so forth. Archived data will be old, but can still contain classified information: not only your own company secrets, but also confidential information about or from your outside relations. You’ve the obligation to keep not only your own data confidential, but also other’s data you keep in your systems. The best way to avoid security risks is to delete the data you no longer need. What isn’t there cannot be stolen.
Privacy laws, likethe European General Data Protection Regulation (GDPR), impose restrictions on the data you can keep in your archive. Masking data like bank account numbers and credit card information might help to comply with the rules. But the law is stricter: any information that can lead to a person, like names and addresses, must be masked sometime. How usable is your data after stringent masking? Is it worthwhile keeping it? Privacy by Design and by Default should also be incorporated in your archive. Maybe purging Personally Identifiable Information (PII) is just the easiest way to comply.
In the United States the situation is even more challenging. Defensible Disposal is needed to delete data with no business value but which might be used against you in a court-of-law. In order to mitigate legal risks, retention and disposal of data is obligatory, even for archived data.
The need for preservation?
If you’ve decided that you want to keep some data forever, there is a last decision to be made. How can I keep my data sustainable and accessible over time? Can I read and interpret the data in ten, twenty years? Just storing a database is raw format isn’t a good idea. In a few decades those formats would not be readable any more, so in effect the data becomes useless. And if you’re able to read the data, do you still know what the data means, the semantics? In order to keep data accessible over long periods of time, you’ve to take precautions. Like converting the data to a canonical format, documenting the semantices and structure of the data, updating the media where the data is stored on and so forth. NASA has started projects to preserve the data from their space missions. Are you willing to do the same?
Keeping data in an archive seems to be an easy solution. Deploying retention policies can be complex and tedious. But on the other side, keeping data without any risk mitigation will lead to mounting costs and and increasing risks. You’ll end up with a pile of useless data you’ve to keep with only costs and no benefits. Data archiving seems so simple, but it isn’t. Whatever you keep, you’ll have to maintain. Whatever you discard, is freeing up your memory.