Effective data management: Data tiering, archiving and purging with SAP BW

Publish date:

In this blog you’ll learn about effective data management; how techniques such as data tiering, archiving, and purging can help you to balance your overall data footprint with the needs of your business.

When I started working in the world of SAP data warehousing, the first technique I learnt was how to design for performance. The star schema, dimension IDs and surrogate IDs were fundamental to the way that SAP Business Warehouse worked, providing users with the most optimised experience possible whilst minimising the data footprint.

Moving forward, the notion that ‘disk is cheap’ meant that while the fundamentals remained, the physical constraints around data modelling began to lift. Developers had more freedom to build data flows with limited concern about their data footprint.

This often meant creating de-normalised, multi-layer data models and snapshots, passing the analytical processing effort from query runtime and into the overnight batch. This was great for reporting performance, but it left a legacy of a large and growing data footprint.

When SAP HANA arrived, the game changed completely. The bar was raised on performance, to levels not previously seen before. Architects, developers, and users alike were excited about the possibilities that this new technology would bring. However, this increase in performance also brought a new focus, cost of ownership.

Today, the data footprint of every new data model in BW should be at the forefront of the development lifecycle. Effective data management is now essential.

What tools are available?

Architects and business owners often have a requirement to maximise the value of their BW landscape whilst balancing this against the initial investment and operational costs. This requires a considered approach, involving several techniques:

  • Data management – an umbrella term that describes the effective ingestion, storage, and organisation of data across a BI landscape.
  • Data tiering – a multi-layered landscape with data distributed across different physical tiers. Each tier offers differing performance levels and associated cost:
    • Hot tier – data that is the most frequently accessed. This is the most performant tier within the landscape.
      • Usage: daily and weekly sales, current stock balances, operational data
      • Usually stored in memory
    • Warm tier – data that is frequently accessed by the business where performance is less critical. This tier strikes a balance between performance and cost.
      • Usage: monthly sales reporting, monthly orderbook
      • Usually stored in an extension node or offloaded to disk
    • Cold tier – data that is still valuable to the business but is accessed infrequently, where performance is not key.
      • Usage: year-end stock reports, year-end account balances
      • Usually stored in a Near Line Storage (NLS) disk-based system such as SAP IQ

Archiving – the process of removing data to separate location which may or may not remain connected to the original system. Archiving is used where the need for the data is limited and there are no ongoing reporting requirements. Archiving may require additional hardware and software investment but typically at a lower cost than a cold tier. For file-based archiving (SAP ADK) usually the only investment is for a network location to store the resultant files. Regular archiving may be part of normal system administration.

Purging – the process of removing data completely from a system. The data is not recoverable. This may be an appropriate option where the data can be re-extracted from the source. It is also appropriate for technical data such as change logs and Persistent Staging Areas (PSAs), post the agreed retention period. Purging requires no additional investment except for administrator hours. Regular purging of technical data should be part of normal system administration.

Why is data management important?

The implications of ineffective data management are significant. Growth will continue and will eventually affect analytic performance and/or require additional investment to maintain the status quo.
There are two main factors to consider: the value of the data to users, including the speed and frequency of access, as well as the total cost of ownership of the system. It is generally the case that the business value of data diminishes over time, therefore the requirement for frequent and performant access also diminishes. Is there a strong business case for storing data in memory that is no longer used regularly? Data tiering, archiving, and purging offer a scalable solution to this problem.

Balancing Competing Requirements

Effective data management requires architects to balance different priorities. Let’s look at two examples. Firstly, where cost is the priority:

Figure 1 – Priority Cost [Source: Capgemini]
The hot tier is limited, reserved only for the most frequently accessed, greatest value data where high performance is essential. Users are able access this data as quickly as possible, enabling timely decisions. The warm tier is also limited, with a greater emphasis on cold storage.

This a cost-effective option, keeping most of the data on disk in an NLS solution, accessed at run time. Administrators can manage the tiers, organising data to meet needs or to respond to business events, maximising efficiency.

Alternatively, performance can be prioritised.

Figure 2 – Priority: Performance [Source: Capgemini]
A high volume of data is maintained in the hot tier, there is greater use of the warm tier and limited use of the cold tier. This works where the data warehouse is serving a wide array of fast-moving business areas, where high performance is a shared requirement. In extremis, the warm and cold tiers could be removed, creating a single, hot tier system. Archiving and purging play a similar but important role in both models.

Where to start?

Firstly, get your (ware)house in order. Effective data management starts with a healthy system. Create a regular schedule for purging your PSAs (if used) and change logs. Move to Operational Data Provisioning (ODP) for your extractors. Compress your info-providers regularly. Archive technical data such as request logs using the SAP Archive Development Kit (ADK). Re-model older data flows to remove legacy physical layers.

Only then should you start to look at the data stored in your info-providers. Perform a ‘data temperature assessment’ and start to classify your data into tiers, typically based on the age of the data. For example, you may classify the most recent two years of data as hot, two to five years as warm and anything over five years as cold or as a candidate for archiving or purging.

Conclusion

Your data warehouse isn’t sitting still. Each day that passes it continues to grow. There will be a point in the future where a choice must be made about how to manage this growth. An effective data management approach will consider data tiering, archiving, and purging, ensuring that you maximise your investment in SAP BW and continue to provide the best analytics experience for your business.

Author


Matt Handy

Senior Solution Architect – Insights and Data

Matt Handy is a Senior Solution Architect in the Insights and Data team at Capgemini UK. He specialises in Business Intelligence and has over 15 years of experience working with SAP products.  His industry specific knowledge includes retailing, consumer products and utilities. He has worked for a variety of clients all over the world

Related Posts

Artificial Intelligence (AI)

Predictions 2022: AI

Date icon January 18, 2022

The top trends expected to hit the world of artificial intelligence in the year ahead

Cloud

SAP Collaborative Enterprise Planning in a Nutshell

Date icon January 17, 2022

In this blog, Luis Lopez explains what SAP’s Collaborative Enterprise is and how SAP...

Artificial Intelligence (AI)

Predictions 2022: Intelligent industry

Date icon January 14, 2022

Our forecasts of what we believe will shape the intelligent industry sector in 2022