Many of us, in our first steps of our data science career, came across terms like classification, decision trees, segmentation, naive Bayes and so many more that sometimes we get confused as to what each one of them means and what they have to do with data mining.
The purpose of this blog series is to explain what data mining covers, what business problems we can solve with it, what sort of methodologies to employ and other aspects that govern data mining projects such as how do we evaluate a data mining work, how do we manage data mining project and more. It is mainly addressed to our junior members and those who aspire to become data scientists with the aim of helping them get a better understanding of the tools and techniques they are using.
Before I start explaining all these terms, let’s start with…what is data mining? Data mining can be defined as the automatic or semiautomatic process of meaningful structural patterns discovery in large quantities of data by applying machine learning techniques.
Data mining technology can be broken down into two broad modeling categories: supervised learning and unsupervised learning. Unsupervised learning methods are employed on data to detect patterns when there is no specific output provided; with the aim to establish these outputs. In supervised learning, the aim is to classify data into given outputs.
Very often, people working in data analysis (including myself) face the problem ‘which machine learning technique to use?’ Choosing the appropriate technique is essential for a successful project outcome. However, in my opinion the first question we should ask before considering techniques ‘what is the problem I am trying to solve?’. Understanding the business problem type can help us narrow down our options and ease the technique selection process. Also, the more we know about the problem, the better choice we will make since each machine learning technique has its own characteristics. Data mining addresses a number of problem types including data description and summarisation, segmentation, concept descriptions, classification, prediction and dependency analysis.
Now, let’s look at the above problem types more closely.
Data description and summarisation aims at describing the data in a succinct and intelligible form (as opposed to simply reading all your data). By simple statistical techniques the user can get an insight of the data’s structure. Typically, data description and summarisation are part of a data mining goal ordinarily applied at an early stage where the nature of the data is still unknown and the analysis goals are not precisely defined. For example, data description and summarisation of a product campaign by consumer age and gender could suggest which consumer groups need to be the focus of further marketing strategies.
Segmentation can be thought of as the process of splitting the data into conceptually meaningful groups where all members of each group share a set of characteristics not found in other groups. Segmentation can sometimes be a data mining objective but very often is a step towards solving other problem types, such as grouping a large dataset to reduce the size and simplify the analysis. A segmentation problem might be the grouping of consumer data according to their income, needs and preferences. Machine learning techniques for segmentation include clustering and neural networks.
Concept description can give a thorough understanding of concepts and groups that can be created by segmentation. For example, an company would like to learn more about the customers of their high value products and customers of their low value products in order to infer what could be done to transform the later to former or keep their customers within the high value bracket. By applying concept description methods the users can generate rules describing each customer category. Suitable machine learning techniques for concept descriptions include rule induction learning and conceptual clustering.
Classification deals with the creation of models (classifiers) used to predict the class or the probability of the class of previously unlabeled data. The classes are categorical and predefined and could be generated by segmentation methods. Classification is considered to be one of the most important data mining problem types employed by a broad range of applications. A classification task might involve trying to determine if an insurance claim represents fraudulent behaviour. A classifier can be generated by creating two classes, one for legitimate and one for fraudulent claims. The classifier can then be used to label new claims to one of these two classes or give a probability of a claim belonging to each class. Some of the appropriate machine learning techniques include decision trees, naïve Bayes, neural networks, discriminant analysis, rule induction learning, case-based reasoning, genetic algorithms and nearest-neighbour.
Prediction is similar to classification; it determines the value of a previously unseen target variable. The difference between classification and prediction is that classification is dealing with discrete data values while prediction works with continuous (numerical) data values. Predictive models can be used for turnover prediction of shares to earthquakes and more. Machine learning techniques for prediction problem types include regression analysis, regression trees, neural networks, genetic algorithms and nearest-neighbour.
- Dependency analysis is employed on problems for finding interesting associations amongst data objects in order to understand their relationships. A typical example of using dependency analysis is to identify associations on shopping baskets (basket analysis) that can then be used by the business to deploy specific advertisement strategies. Machine learning techniques for dependency analysis include decision trees, regression analysis, association rules and Bayesian networks.
Now that we have a better understanding of the business problems we can solve with data mining, we shall look in more detail at the machine learning techniques available for each one…so stay tuned for Data Mining – What, Why and How – Part 2.
- Student applications open for the international Data Science Game at Capgemini’s Les Fontaines
- Matt Thompson talks about how to take Machine Learning to the next level by combining multiple analytical techniques together
- Natalia Angarita discusses how Machine Learning can be applied to public services
- Kannan Jayaraman gives his views on how Analytics can drives optimisation in public services
- Toby Gamm talks about Assurance Scoring
- Tom Sinadinos discusses about Network Analysis at Scale