How to use SAP Analytics Cloud’s smart predict to discover deep insights and make informed decisions
_{Training the Smart Predict tool.}

Shikha Parihar

July 29, 2024

In the second of our series on Predictive Analytics, lead software engineer Shikha Parihar explores how SAP Analytics Cloud’s Smart Predict functionality uses AI technology to help discover deep insights, simplify access to critical information, and allow informed decision-making.

Last time, we explored how various industries are leveraging predictive analytics to boost their revenues, and how the SAP Analytics Cloud (SAC) is the perfect tool to achieve that. Here, let’s focus on the predictive modelling used in SAC.

So, what is a predictive scenario?

A predictive scenario is a preconfigured workspace used to create predictive models and reports that address a business question which requires the prediction of future events or trends.

There are three types of predictive scenarios:

Classification
Regression
Time Series Forecast

Phases of predictive modelling

1. The learning phase

The model is trained to find trends in historical data to forecast the target in the following, or later, months. In this example, the target data time frame (April) occurs AFTER the historical data time frame (January to March). We’ve set a reference date of 1^st April, and the model is trained on three months of historical data to predict what happens in the following month (May).

2. The applying phase

In the applying phase, the model is applied on current data where the outcome is unknown and predicts the outcome probability for each client ID. In this example, the model is applied on the latest three months of data (April – June) providing the prediction of the probability of churn in the future month (July).

We’ve set a reference date of 1^st July, and the model is trained on three months of current data to predict what will happen in the following month (July).

Datasets

In the case of a classification or regression predictive model, the data source input dataset can either be a training or an application dataset. A time series forecasting model uses a single dataset for both training and testing.

1. Training dataset

The training dataset contains the past observations that we will use to generate the predictive model. By analysing the training dataset, Smart Predict generates a predictive model that explains and predicts the target variable, based on the variables identified as influencers.

2. Application dataset

Once the predictive model is built (trained), it is then applied on an application dataset. This application dataset must contain the same information structure as the corresponding training dataset:

The same number of variables (additional columns will be ignored)
The same variable names as the corresponding training dataset
The same order of presentation of these variables

Once the predictive model is applied, the predicted values of the target are calculated in the output dataset.

3. Output dataset

An output dataset contains the result of applying the predictive model to the application dataset and any additional information requested. Once the predictive model is applied, the predicted values of the target are created in the output dataset.

The output datasets are saved by default in the folder: Main Menu/Browse/Files, but you can choose another directory if required, or save the output in SAP HANA (if you are connecting to SAP HANA).

Building the predictive model

To build the predictive model, we must define the target and influencer roles.

1. Target Variable

A target variable is a variable that needs to be explained or predicted. For example, a company wants to predict the number of complaints a customer support team will receive this week. The target variable is <Number of customer complaints> and it will take <numerical> values.

IMPORTANT: If the value <Yes> is the least frequent value, the application automatically considers that value to be the target category of the target variable.

2. Influencers

The influencers are variables that describe the data and explain a target variable. An influencer variable corresponds to a column in the dataset. Influencer factors may be left out of the training data source while developing a prediction model (the learning phase).

These exclusions are not considered while calculating the predictive model, they are not included in the statistics for the predictive model, they are not obtained from the data source, and they are not required when the predictive model is used on an application dataset.

Especially for variables that indirectly include the target variable, keep out variables that have a direct connection to it. The leakers or leak variable are those variables which are in some way causally related to the target variable. However, instead of being a cause for whatever your target variable represents, they are the result.

Because of their strong correlation with the target variable, these leakers frequently provide an inaccurate model with a high predictive power indicator. To prevent data leakage, any variable created or updated after the target value reference date should be excluded as that data won’t be available when you use the model to make new predictions.

During the training, smart predict selects an optimised number of influencers to include in the predictive model, so the toggle “limit the number of influencers” is turned off by default.

Hold Out Samples

This is a sample of observations from model learning, created automatically created when building the model. The model’s ability to predict future probabilities can be estimated by its ability to predict the data in the hold out sample.

Data is partitioned and is split into:

A training subset to train the models.
A validation subset hold out sample to test the model’s performance and choose the best performing model from a range of candidates.

The following process is:

The analytical data set feeds into the partition data where it is split into training and validation subsets.
The training subset is used to train candidate models.
The validation subset is used to evaluate candidate models to choose the best one and evaluate model performance.

Stay tuned for the next instalment, in which I’ll expand on data preparation for predictive scenarios and explore the concept of data structures for predictive models.

Capgemini and SAP

With four decades of experience with SAP solutions, serving 1,800 clients across the world, we are a leader in SAP certifications, an SAP Global Strategic Services Partner, and an SAP Global Platinum Reseller Partner. We can help you innovate, integrate and transform, so you can continue to grow, quickly adapt to any context, unlock and enhance business value, and stay ahead of your competition.

Get in touch to start the conversation today.

Author

Shikha Parihar joined Capgemini in March 2023 after a career break of 5 years. With a strong focus on visual data analytics and 11 years of experience, she is an SAP BI/BW skilled professional.

Contact with us

First Name *

First Name is not valid.

Last Name *

Last Name is not valid.

Email *

Email is not valid.

Company *

Company is not valid.

Country

Country is not valid.

Phone (optional)

Your Message *

Your Message is not valid.

By submitting this form, I understand that my data will be processed by Capgemini as indicated above and described in the Terms of use .

Expert title

Page URL

Slide to submit

Your form has been successfully submitted.

We are sorry, the form submission failed. Please try again.