On my previous blog Data Mining – What, Why and How – Part 1 I talked about data mining, its modelling types and what business problems we can solve with data mining giving examples of machine learning techniques that can be applied for each problem type. If you are an aspiring data scientist, I hope this will help your understanding of how to use data mining to help solve business problems.

The following table lists a number of machine learning techniques per modelling category and problem type.

Now, let’s start with decision trees…

Decision trees

The decision trees can be described as a *divide-and-conquer* approach of expressing the effect of the independent variables that influence the decision also known as explanatory variables over a dependent variable also known as response variable. They are called decision trees as this approach of learning results in a tree-shaped model describing a set of decisions as depicted in the Figure 1. Decision trees consist of nodes that act as decision blocks, leaves acting as terminal blocks and branches that lead to nodes or leaves. Each node specifies a condition of an attribute and each branch coming out from the node corresponds to one of the attribute’s values.

Figure 1 – Decision trees illustration

Decision trees problems are mainly categorised into *classification* and *regression**trees*.

*Classification trees*

In classification trees, the type of the response variable is categorical and is predicted by one or more categorical or continuous explanatory attributes. Classification trees are used to predict the class that an observation belongs to, given a set of measurements.

*Regression trees*

Regression trees aim to predict continuous response variables from a set of continuous and/or categorical explanatory variables. Regression trees are based on the idea of linear regression model for quantitative predictions. Each decision node contains a test of an explanatory variable value while the leaves contain a predicted function value.

Rule Induction Learning

Rule induction learning schemes work in a separate-and-conquer approach to create classification rules. Rule induction learning methods choose all instances for each class and extract rules that cover all the instances in it. The result of this approach is a generation of a set of classification rules, rather than a decision tree.

An example is illustrated in Figure 2. Let’s assume that we have a dataset consisting of instances of class c1 and c2 (Figure 2a). By choosing the class c1, the dataset is separated horizontally on its x axis where let’s say x = 2.4 (Figure 2b) which forms the first test in the first rule for class c1:

1. If x > 2.4 then class = c1.

However, the above test covers some of the c2’s instances as we need another separation on axis y (Figure 2c) where let’s say y > 3.5 which is added to the above rule:

1. If x > 2.4 and y > 3.5 then class = c1.

The above rule covers all but one of c1 instances (Figure 2c), so another rule has to be generated:

2. If x > 5 and y > 3 then class = c1.

The same approach is followed for class c2 instances.

Figure 2 – Rule induction learning example

Bayesian Analysis

Bayesian analysis is a statistical modelling approach which endeavours to estimate the likelihood of an observed event. Suppose there exists a set of events that are equally possible and only one of these can actually occur but it is unknown as to which one; Bayesian analysis, in contrast with classic statistics, assigns a probability to each of these based on previous knowledge of the particular event. The previous knowledge is also known as prior probability in that it is available before the observed event occurs. The probability that results from Bayesian analysis is known as posterior probability.

Bayesian classifiers are able to predict the probability of an instance belonging to a class. Let I be an instance whose class is unknown and H a hypothesis that I belongs to a certain class, the posterior probability of hypothesis H given instance I is calculated by the Bayes’ theorem, namely:

where, P(H|I) is the conditional probability of the hypothesis H considering the instanceI, P(I|H) is the probability of the instance given that the hypothesis H is true, P(H) is the initial probability assigned to the hypothesis H and P(I) is the initial probability assigned to instance I.

The above method goes with the name naïve Bayes classification because it assumes that the attributes of an instance are all statistically independent to each other and that all attributes are equally important to the outcome.

Bayesian Networks

Naïve Bayes can only represent simple distributions compared to decision trees which can represent arbitrary distributions. Bayesian networks allow the representation of class probability distribution between subsets of variables in a graphical structure. A Bayesian network is drawn as a directed acyclic graph where each node corresponds to one attribute containing a probability table for each of the attribute’s values. A sample representation of a Bayesian network is given in Figure 3.

Figure 3 – Bayesian network representation

Artificial Neural Networks

Artificial neural networks are considered as a computing system consisting of multiple interconnected simple processors that digest information received from external input and give a dynamic response. Artificial neural networks work similarly to the human brain when solving a problem by creating neural paths from previous knowledge that are used as patterns for future predictions.

Inspired by biological nervous systems, an artificial neural network is composed of nodes, also known as neurones, which can act as inputs, output or intermediate processors. Each node is connected to a set of neighbouring nodes by means of a series of weighted paths and all work together to solve a specific problem.

At first, an observation is analysed and weighted according to past experience. Subsequent observations are filtered through the neurons in order to make a prediction. The prediction error is assessed, and then the model attempts to modify the weights to improve the prediction, and then moves on to the next observation. This cycle repeats itself for each observation in what is termed the training phase, when the model is being calibrated.

Artificial neural networks, like humans, learn by example and are used for detecting very complex relationships between inputs and outputs.

Figure 4 – An artificial neural network illustration

Clustering Analysis

Clustering analysis is a technique for creating meaningful groups by dividing a set of observations based on their similarities. Clusters can give insight of the observation’s dimensionality, identify outliers and reveal relationships between data objects and conclusions about the domain from which the data is drawn. In clustering analysis, no assumption is made regarding the number of clusters or their features. It is a common practise, after clustering analysis is performed, to infer some sort of decision trees or classification rules that allocate each instance to the appropriate cluster.

Clustering results can be illustrated by laying out the instances on two dimensional space and split the space accordingly where each split corresponds to a cluster as shown in Figure 5a. In cases where one instance can belong to more than one cluster then Venn diagrams are employed as shown in Figure 5b.

Figure 5 – Clustering results representation

*Heuristic Distance-Based Clustering*

Heuristic distance-based clustering results are usually expressed as exclusive groups where each observation belongs to one and only one cluster or as overlapping where an observation might belong to more than one cluster. The classic distance-based clustering technique is k-means clustering which divides a set of instances into k disjoint clusters.

*Hierarchical Clustering*

Hierarchical clustering algorithms produce tree-shape structures where the root denotes the instance space which is divided by few clusters at the next level which in turn are divided by sub-clusters and so on.

*Probability-Based Clustering*

Clustering results can also be expressed as probabilistic associations produced by probability-based clustering algorithms that assign to each training instance a probability of membership in each of the clusters.

Nearest-Neighbour Learning

Nearest-neighbour learning, in contrast to other learning schemes described in previous sections, does not apply any modelling techniques. Instead the training instances are memorised, representing the knowledge, in order to relate a new unlabelled test instance it searches the training instance space to find the ‘most like’ one and assigns its class to the unlabelled testing instance.

Nearest-neighbour learning scheme is the simplest form of learning, also called instance- based learning, as no operation is performed until the time comes to make a prediction. An example of nearest-neighbour learning is illustrated in Figure 6. The blue and orange points are of two different classes. When a new instance (black point) is to be classified, the distance between its neighbours is measured. The nearest neighbour is chosen (orange point) which assigns its class to the new instance.

Figure 6 – Nearest-neighbour learning example

For an in-depth explanation of the above techniques and more I recommend the book Data Mining: Practical Machine Learning tools and Techniques.

For those who want to learn about our current data mining projects and events our team is involved, read below:

- Student applications open for the international Data Science Game at Capgemini’s Les Fontaines
- Matt Thompson talks about how to take Machine Learning to the next level by combining multiple analytical techniques together
- Natalia Angarita discusses how Machine Learning can be applied to public services
- Kannan Jayaraman gives his views on how Analytics can drives optimisation in public services
- Toby Gamm talks about Assurance Scoring
- Tom Sinadinos discusses about Network Analysis at Scale