You have to manage your Data Lake – the fallacy of technology being magic

Publish date:

Gartner have published a report calling Data Lakes a fallacy in which they point out many of the issues with an unmanaged Hadoop environment.  It’s a great headline but actually in the paper itself raises exactly the points that we made in Capgemini back in 2011 about what companies should be doing in this space.  […]

Gartner have published a report calling Data Lakes a fallacy in which they point out many of the issues with an unmanaged Hadoop environment.  It’s a great headline but actually in the paper itself raises exactly the points that we made in Capgemini back in 2011 about what companies should be doing in this space.   Back then we published a paper on Mastering Big Data which talked about how data governance was a core requirement to get value out of Big Data.
This thinking was built upon in 2012 when we talked about the need to consider as in-memory as being a core requirement of a modern information landscape and in 2013 we talked about how Hadoop works collaboratively with traditional EDW and LDW approaches.  Last year we spend a significant amount of effort creating the ‘Business Data Lake’ which puts just the governance and data management practices at the heart of the approach.
Gartner raise a very valid point, basically that Hadoop isn’t magic.  Just dumping data into a single repository doesn’t mean that it’s now magically easy to use.  But dismissing data lakes altogether is “throwing the baby out with the bathwater.”  Where I think they’ve missed the opportunity is that they don’t talk about how things move forwards.  Logical Data Warehouses (LDW, described by Gartner in the Magic Quadrant for Data Warehouse Database Management Systems) are not the answer either. They have been the norm in most decent EDW programs for many years but it still isn’t giving the agility, flexibility and access to “all data” that a data lake approach can give.  What is needed is hybrid approach that the business to blend different approaches based on their requirements – the “Business” Data Lake.
It’s a fallacy to believe that a Data Lake will solve all your problems, it’s also a fallacy to believe that an EDW with associated LDW will solve them all as well.  The reality is that information and insight consumption today is a key driver of business success, it’s also the reality that new analytics approaches based on Big Data have been shown to outperform highly tuned approaches on smaller data sets, this means that any successful information strategy has to enable multiple approaches based on the business demand.
At the heart of the Business Data Lake is a simple philosophy that making a business better is about improving at the local level of execution.  Having a corporate report that identifies a problem that has occurred is not as good as local analytics that fixes it before it has a chance to happen.  What having a managed Hadoop Data Lake provides is the ability for multiple different types of consumption to be done depending on the business challenge being addressed.
A corporate warehouse view is still part of the Business Data Lake, but in addition it provides an approach to address Operational Data Store, in-memory analytics and a multitude of other consumption models in a consistent way.  Its this need for consistency that Gartner hints at when it makes the recommendation:

Focus on semantic consistency and performance in upstream applications and data stores instead of information consolidation in a data lake.
For me however this isn’t a question of ‘instead’ it’s a question that you should be doing both.  Information has to come from source systems, in a traditional model it goes through an ETL process that transforms the data into a new form for that upstream application.  For me this is exactly where Hadoop adds a significant benefit and why traditional vendors such as Teradata are recommending this approach. The real challenge is about how you collaborate between the EDW/LDW and the high volume Hadoop world.

This is why we chose to work together with Pivotal when fleshing out how to technically realize the Business Data Lake.  Pivotal have that traditional EDW & LDW technologies with Greenplum, they have the in-memory capabilities of Gemfire, Gemfire XD and SQLFire and of course the Pivotal HD and HAWQ technologies that provide both Hadoop and ANSI-SQL on Hadoop.  We came together to work on what it will actually take to operationalize this new unified approach to data, an approach that doesn’t consider Hadoop Data Lake v EDW to be either/or but instead about how we make them work together and how we govern the environment in a way that matches the business.
Gartner are right to talk about the challenges of unmanaged Hadoop and treating everything as unstructured but this is not my experience of how people are looking to use Data Lakes.  What people are looking for is a managed data substrate which feeds upstream applications and provides a more managed approach that traditional ETL and most critically uses that management to meet EDW, LDW, Data Science, in-memory and a multitude of other business information consumption requirements.
This is all about what the business wants to see and about ensuring that information delivers the value that it should do.  This is where we created the term ‘distillation’ (a creation of the Gemfire team at Pivotal BTW) to reflect this new business centric approach.  It is through distillation that governance is applied, that meta-data and MDM are used to create the specific view for the specific business purpose.  The distilled view could be as large as a corporate EDW, an LDW subset or as specific as an Excel spreadsheet.  The point is that a Business Data Lake provides you managed access to all the information and concentrates on all business information requirements not simply those that fit within an EDW/LDW model. 
As CMOs, COOs, CISOs and others demand more from information and the value delivered from information increases it is not enough for IT to offer a choice between an unmanaged Data Lake and a fully managed EDW/LDW but instead IT must be able to adapt to changing requirements and offer a more elastic approach to information.  This means delivering the appropriate information, appropriately governed and enabling flexibility in consumption that best matches the business value being delivered rather than best enabling IT cost management.  Information is a value driver for business and the Business Data Lake provides the flexibility for businesses to deliver that value based on the right approach at the right time.
It’s great that Gartner are highlighting the need for governance and the need for considering both structured and unstructured information in a unified manner.  Where I disagree is in seeing this as a choice between a Hadoop Data Lake and an EDW/LDW approach.  The reality is that, as Gartner highlight, both approaches deliver different benefits to different user groups and it’s this message that I’ve taken further in the last few years since the Mastering Big Data paper into the work that was evolved into the Business Data Lake.
At Capgemini we’ve been championing governance in a Big Data world and it’s great to have Gartner agreeing with us.

Related Posts

Artificial Intelligence

Intelligent automation is building a digital future for Financial Services – Part 1

Date icon November 17, 2021

From quicker time to market for new products to data-driven cross-selling opportunities,...

Insights & Data

Gesture recognition for a safer, more inclusive society

Date icon August 16, 2021

The emergence of hot tech: Gesture control and touchless user interfaces ~ for a low-touch,...

Insights & Data

Time to shift the gear with software defined vehicles

Date icon August 16, 2021

Software and Data Drive Change in the Automotive Industry.