B is for “But I only need analytical insight once a year…”

Publish date:

I was struck the other day by a client use case  – no names – no pack drill. Aside to say it’s a company that sells millions and millions of items a year – and has regulators that dive deep into key stats – but only once a year.   Rolling product analytics are done […]

I was struck the other day by a client use case  – no names – no pack drill. Aside to say it’s a company that sells millions and millions of items a year – and has regulators that dive deep into key stats – but only once a year.
 
Rolling product analytics are done locally in region. Finance, etc., is dealt with by the Enterprise Data Warehouse at a quite an abstracted layer. But how to deal with the annual reporting to key regulators is a challenge at this scale. Keeping all transactional data globally – and with audit trail to source (even for say, just the US) is an expensive task.

Let’s take a record size of 1250bytes for full transactional data with the traceability to source, etc.

1250bytes*1BN transcations = 1.14tb per year
 
For the last 10 years of data we are at 45TB. For 20 years we are at 910TB – all for transactional data we dare not throw away due to the risk of long term class actions and regulator challenges…
 
…It’s likely I need it once a year, and will probably have to keep it for up to 100 years

What’s the option? Store to tape? Keep it in live storage? Both are costly – heat, light, power, etc – and even the future of tape is getting questionable. Slow, big flash storage is a few years away but will come. There another way that we see – cold store with temporary business data lake capability.
 
“Cold store” the data to Amazon Glacier or take the approach offered by Facebook opensource – a BluRay archive capable of multi-petabyte stores – it doesn’t really matter which, all have long term certainty. Recovery time is irrelevant, providing it is minutes and not hours. Cold store is cheap, an easy win when meeting the regulator requirements are a pure cost overhead.
 
Once a year, as regulator demands:

  • Abstract either the whole global data set, or a region as needed,  into Hadoop with the previous years as needed.
  • Distill the data sets using large scale data set tools like Pivotal’s PDD
  • Provide the regulator with line of business view needed
  • Complete, shutdown and archive to cold store the new data sets

This extends what Capgemini thinks is the wider architecture for the Business Data Lake – for most use cases very high latency storage would be a problem – this approach gives an archival method for your legacy, low usage data. It’s a choice based on business demand to match cost requirement – not that the Data Lake is inherently high latency.

It meets the demand case for when you need insight just once a year.

Related Posts

AI and analytics

Spotlight on Capgemini NA @ Informatica World 2018 | May 21–24 in Las Vegas

Jackson, Dusty
Date icon July 10, 2018

Spotlight on Capgemini NA @INFA World 2018 with key representation from Dusty Jackson, Scott...

Consumer Analytics

Bullwhip effect applied to a data supply chain

Denis Sproten
Date icon June 22, 2018

Take a look at how the bullwhip effect translates into the data supply chain built for your...

Artificial Intelligence

Even the artificial intelligence you buy is prejudiced

Reinoud Kaasschieter
Date icon June 21, 2018

When wrong data is fed into the algorithms, they also make the wrong decisions. Learn why do...

cookies.

By continuing to navigate on this website, you accept the use of cookies.

For more information and to change the setting of cookies on your computer, please read our Privacy Policy.

Close

Close cookie information