For 30 years in IT we’ve had an approach, the Enterprise Data Warehouse, and its served us pretty well. By ‘us’ I mean that it’s helped IT to manage the costs around reporting pretty well. Two things have driven those costs:
- Data Storage
- Data Movement
Warehouse is a great metaphor for how this has worked. What does it take to run a really great warehouse, not just for data but a great physical warehouse for a retailer or logistics company?
Well first you need the infrastructure – the shelves and floor space – this has have physical limitations. They can store boxes and palettes of certain sizes and the amount that you can stores is limited by some very real cost constraints, in theory you could build a warehouse that can store everything you ever need but in reality the cost of land, the storage space, means you need to limit what volume can be carried. Next up you need to know here things need to go, its no good just taking things off a truck and throwing them in, to be efficient you need to plan where things go and how you need to pull them out.
To get that plan what you need to know is what you need to store, that way you can map out the warehouse and assign different products to different places, you can optimize the plan based on which things get ordered most regularly and which things get ordered together. In this way you get a great layout that optimizes exactly the boxes and pallets you are expecting. This way when boxes come in your can be as efficient as possible getting them in and getting them out.
For a physical warehouse, store or data warehouse the constraints are simple: it costs to store and its costs to move.
But what if your business isn’t just about selling things in your own warehouse? What if you are trying to sell not just your own stock but act as a channel for multiple organizations? The sort of thing that eBay or Amazon do. What if you have to start handling new in demand products at a moments notice? Suddenly the rigid rules of storage and movement have changed, if it’s not your storage space and it’s not your movement costs then does the level of rigor and planning apply? The answer is of course not, it would be madness to try and enforce a single rigid approach on everyone you deal with, you don’t know how they work and you don’t know what their layout is.
Warehouses work when you can define everything, the problem with data is that you can’t. The ‘Customer’ for the sales team isn’t the same as the ‘Customer’ for the Finance team, so you create an artificial definition of ‘prospect’ to handle the discrepancy. One division sells nuts and bolts, another sells particle accelerators, their product definitions, orders, invoices and finance terms are completely different, getting them to agree isn’t simply difficult its impossible. As new data becomes available the business needs to incorporate it straight away, it can’t wait for an extension to be built and months of delay, it needs the insight right now.
The point here is that in the fluid world of business the Warehousing approach to data no-longer works. We need an approach where businesses can siphon or distill off the information they need when they need it. It’s this philosophy shift that underpins what we’ve done at Capgemini with Pivotal around the Business Data Lake, changing the game to something that better meets the needs of a business.
First of all there is the Lake, this isn’t a rigid definition, its just a place that all of the data in the enterprise can be dropped, so every system, every partner and every external source flows into a single repository ready for use. There is no restriction on schema or plan at this stage it is data in its raw and native form.
The next stage is distillation, taking just what you want from the lake and refining it for the purpose that you want it: the personal perspective not a single centrally planned view. Crucially in the Business Data Lake the raw data remains in the lake so multiple perspectives are possible on the same source, everyone does not have to agree on what they want, everyone can get their perspectives on the lake in a manner that enables their local business success.
This flexibility however would be nothing if it did not allow corporate consistency; this is where targeted information governance comes in. Through the use of MDM, RDM and controls, the sort of controls that already ensure organizations report margin and finance elements consistency, to delivery corporate collaboration and visibility only where it is required.
The Warehouse was a metaphor that worked in a world where IT faced cost constraints around storage and movement and needed to apply rigor and consistency. In the world of Big Data and without those restrictions a more dynamic and fluid approach is required, an approach that helps the business adapt and change as its circumstances change and which does not try to impose a single centrally planned view in a dynamic market economy.
The Business Data Lake is data for the new economy.