Machine learns to recognize a cow among other animals.

In my last blog, Historical Data can make a big, hairy mess in Machine Learning, I illustrated the challenges of managing the data carefully using an instance of a priest in a temple, with rats and a cat. In this blog I explain the technical aspects of modeling training data and testing data for machine learning systems.

Machine Learning (ML), Artificial Intelligence (AI) and the five senses

At Capgemini, we take a more visceral, human-like approach to Artificial Intelligence. Chris Stancombe, Head of Group Industrialization, coined the term The Five Senses of Artificial Intelligence, where he conceptualizes the key elements that make up automation as being akin to five distinct human senses: watch (monitor), listen/talk (interact), act (service), think (analyze) and remember (knowledge). Our approach is AI and ML to integrate all the five senses and bring in an integrated solution. So, in the journey of our automation, ML plays a major role and hence focus on the role of data models in ML.

Under-fit and over-fit data models in ML

Assume that we have a dataset of cows and all other animals, birds, fishes, etc. We are building a machine learning platform to recognize whether a given animal is a cow or not.

Consider the first method, which says, a cow is fat and heavy and has four legs. This is a very general way of describing a cow. Mathematically, we represent this by a function defined as follows:

This function is very general and given any cow, the function can recognize correctly but even other animals can be mistaken as cow. This is a very simple model. The training error and testing error are very high in this function. This is called under-fitting as these two attributes cannot specify the characteristics of a cow fully.

  Consider a complex description of a cow chosen for the training data. The description says that a cow is 3 feet tall, 235 Kg weight, has four legs, two horns, brown in color, and has a tail. It is possible to continue describing a given cow more increasingly full.  Mathematically, we represent this by a function  as follows:

This description is very specific and includes several attributes, some of them may be irrelevant for the model (for example color). This function exactly matches the given particular cow (training data), so it is perfectly fitting the cow. So, training error is very low. The trick is there are other cows in the dataset with other variations, may be 3.5 feet tall, 245 Kg weight and even cows with no horn. These cows will not be recognized when we use the function. This is called over-fitting. So, the testing error will be very high. In the temple priest story in my last blog, the cat is not a relevant attribute and the over-fit model included the cat and hence created the mess.

Right-fit data model for ML

While f1  is too general with very few attributes with very broad range of values in the model (under-fitting), f2  is too specific and has too many attributes and very specific values (over-fitting). These two functions are two extremes, and we look for the sweet spot in between the two, the best hypothesis called the right-fit. For example the following function may be an optimal model.

This optimal function f0  has more chances of making fewer errors both in training and testing. In other words, the training error and testing error are both low.

 f1  f0  f2
  • Simple model
  • Too general
  • Recognizes most cows
  • Recognizes other animals as cows
  • Training error high
  • Testing error high
  • Optimal model
  • Training error low
  • Testing error low
  • Complex model
  • Too specific
  • Several cows will not be recognized as cows
  • Chances of recognizing other animals as cow is low
  • Training error low
  • Testing error high

Figure 1—Data models for ML to recognize a cow among all animals

Intelligent investment guidance

One of the prominent use cases for Machine Learning widely used is “intelligent investment guidance.” Historical data of investment decisions are used to build the training data model and use the model to recognize if a new proposed decision will be profit-earning or not. If we create a data model with very few parameters (under-fit) and use the platform, it will recognize every proposal as a profit-making one. However, if we do deep analysis of every single aspect and build the model (over-fit) it will be too conservative and one of the proposals will be recognized as profitable. So, we need to balance between the extremes and identify the right set of parameters to make the model more realistic and useful.

Cyber security alert

In developing a machine learning based cyber security alert system, we use the historical data of security violation incidents to build the data model for recognizing an alert scenario. We must consider the right level of parameters and the threshold values and make the system more reliable and less of a nuisance. A under-fit will create a nuisance, alerting too often even on save scenarios. An over-fit will be less reliable, not recognizing real threats because some of the specific parameters and values do not match with the model.

From the above discussion, the data scientist’s role is vital in designing the machine learning system. The data scientist must understand the business domain and the business problem in the given context. With the growing focus on the role of data scientists, the success of the future business lies more in their hands. The right design of the training data model with adequate data volume will make the business more intelligent, business reimagined with Artificial Intelligence.