I was struck the other day by a client use case  – no names – no pack drill. Aside to say it’s a company that sells millions and millions of items a year – and has regulators that dive deep into key stats – but only once a year.
Rolling product analytics are done locally in region. Finance, etc., is dealt with by the Enterprise Data Warehouse at a quite an abstracted layer. But how to deal with the annual reporting to key regulators is a challenge at this scale. Keeping all transactional data globally – and with audit trail to source (even for say, just the US) is an expensive task.

Let’s take a record size of 1250bytes for full transactional data with the traceability to source, etc.

1250bytes*1BN transcations = 1.14tb per year
For the last 10 years of data we are at 45TB. For 20 years we are at 910TB – all for transactional data we dare not throw away due to the risk of long term class actions and regulator challenges…
…It’s likely I need it once a year, and will probably have to keep it for up to 100 years

What’s the option? Store to tape? Keep it in live storage? Both are costly – heat, light, power, etc – and even the future of tape is getting questionable. Slow, big flash storage is a few years away but will come. There another way that we see – cold store with temporary business data lake capability.
“Cold store” the data to Amazon Glacier or take the approach offered by Facebook opensource – a BluRay archive capable of multi-petabyte stores – it doesn’t really matter which, all have long term certainty. Recovery time is irrelevant, providing it is minutes and not hours. Cold store is cheap, an easy win when meeting the regulator requirements are a pure cost overhead.
Once a year, as regulator demands:

  • Abstract either the whole global data set, or a region as needed,  into Hadoop with the previous years as needed.
  • Distill the data sets using large scale data set tools like Pivotal’s PDD
  • Provide the regulator with line of business view needed
  • Complete, shutdown and archive to cold store the new data sets

This extends what Capgemini thinks is the wider architecture for the Business Data Lake – for most use cases very high latency storage would be a problem – this approach gives an archival method for your legacy, low usage data. It’s a choice based on business demand to match cost requirement – not that the Data Lake is inherently high latency.

It meets the demand case for when you need insight just once a year.