In this multi-part series we shall look at the use cases for a “Data lake” in a typical Enterprise Application landscape.
Data Lake: the data migration use case
In the implementation journey of any Enterprise Application like ERP, CRM etc, successful data migration is a key milestone. The same holds true for instances’ consolidation, enterprise application migration, global rollouts, upgrades etc. In a majority of cases, the objects being migrated from the legacy applications are mostly the master & reference data objects.
The scope for transaction data migration is usually limited to open inventory & G/L balances, open purchase orders, open sales orders and other open transactions only. What happens to the closed transactions belonging to the current financial year or those belonging to the closed financial periods? In other words where does the history of an organization reside? In some cases, this legacy data may be backed up on tape drives, usually unusable because the corresponding legacy application has already been decommissioned and the servers switched off, with no practical means left to retrieve the information. More than the application transaction history, what about other unstructured data like server logs, interface files, documents, point-of-sale transactions, call detail records etc.?
Does it mean that if an organization changes it Enterprise Applications platform, it loses connect with the past. Without large amounts of historical data, how can it hope to achieve the following objectives:
· Improve sales & inventory forecast
· Utilize techniques like predictive analytics
· Improve preventive maintenance
Traditional data migration
Due to the cost prohibitive nature of data warehousing tools like ETL, many organizations depend on homegrown or SI offered custom solutions based on MS Office, .NET etc to perform data migration. Even if they can afford traditional ETL tools like SAP Data services, the amount of data in staging area is limited. This can be due to lack of resources like licenses, skilled manpower, processing power, time window available for ETL, lack of capability of ETL tools to handle variety of data etc.
Hence, to achieve data nirvana, the following are the key impediments:
· Cost of hardware, software & storage
· Capability to handle all data types
· Ability to handle vast amounts of data in an always ready mode
But is there life after death for data?
In comes BIG DATA technology along with the data lake. Big Data refers to the large amounts of poly-structured data that flows continuously through and around organizations. The cost of the technologies needed to store and analyze large volumes of diverse data has dropped, thanks to open source software running on industry-standard hardware. The cost has dropped so much, in fact, that the key strategic question is no longer what data is relevant, but rather how to extract the most value from all the available data.
The common use cases of big data technology like Hadoop include:
1. As a flexible data store
2. As a simple database
3. As a data processing engine
4. Together with SAP HANA for data analytics
Big data technology uses Hive & Pig scripts, Flume, Sqoop etc as its ETL tool. Traditional ETL tools have also adapted by adding the capability to “translate” an ETL job to a MapReduce job using JAQL technology. Thus, the ETL job is rewritten as a JAQL query which gets executed as a MapReduce job on Hadoop. SAP Data services is able to auto-generate Pig scripts to read from and write to HDFS including joins and push-down operations.
Next Generation Data Migration
Hence the recommended approach for data migration using a Data Lake would be as follows:
Extract all data from the legacy source systems including all master and transactional relevant data in its entirety to a Data Lake. The fundamental technology in the Data Lake that is relevant to this process is HDFS (Hadoop Distributed File System). The data ingestion into the Data Lake from the legacy sources can leverage the standard Big Data ingestion methods or traditional ETL tools like SAP BODS for pushing data into the Data Lake.
TRANSFORM – cleanse, standardize, and de duplicate
Profile the legacy source data and generate a data quality assessment score. Perform cleansing iterations, apply data and business validation rules, legacy to target mapping exercise and standard data quality cleansing. As the quantum of data to be processed is huge, the fundamental technology in the Data Lake that is relevant to this process is mapreduce. The raw data and the cleansed data will reside in the Data Lake in different locations.
As mentioned in a blog by my colleague “MDM & Big Data”, this process can be aided by a MDM tool like SAP MDG to create a golden record of all master data in the Data Lake.
Finally, the cleansed master data will be pushed back into the Data Lake from the MDM application in order to ensure that the all the present and future consumers of data in HDFS have cleansed master data to better enable their activities. The transactional data objects in HDFS will be migrated, after being mapped back to the new master data records, as part of the cutover and go-live activities, into the new Enterprise Application System.
We now have a cleansed set of transactional and master data derived from all the legacy systems including all the relevant history right from the birth of the organization, ready for use in ERP, CRM etc as well as EDW, Analytics and in-memory appliances like SAP HANA.
Big data technology can thus be used to “keep existing data around longer” thus facilitating the following:
· Faster decommissioning of legacy systems
· Historical data feeds into the Enterprise Data Warehouse enabling better insights
· Bring down cost of data storage
· Create an online archive- data that was once moved to tape can now be queried to understand long term trends
· Compliance retention – industry specific requirements for data retention
· Combine with external historical data sources- machine sensor data, weather, survey, research, purchased etc later in SAP HANA for instant analysis
Thus the power of big data technology in terms of almost infinite storage and processing power can be used for giving a second lease of life to archived data.