Data Mining – What, Why and How – Part 1

Many of us, in our first steps of our data science career, came across terms like classification, decision trees, segmentation, naive Bayes and so many more that sometimes we get confused as to what each one of them means and what they have to do with data mining.

The purpose of this blog series is to explain what data mining covers, what business problems we can solve with it, what sort of methodologies to employ and other aspects that govern data mining projects such as how do we evaluate a data mining work, how do we manage data mining project and more. It is mainly addressed to our junior members and those who aspire to become data scientists with the aim of helping them get a better understanding of the tools and techniques they are using.

Before I start explaining all these terms, let’s start with…what is data mining? Data mining can be defined as the automatic or semiautomatic process of meaningful structural patterns discovery in large quantities of data by applying machine learning techniques.

Data mining technology can be broken down into two broad modeling categories: supervised learning and unsupervised learning. Unsupervised learning methods are employed on data to detect patterns when there is no specific output provided; with the aim to establish these outputs. In supervised learning, the aim is to classify data into given outputs.

Very often, people working in data analysis (including myself) face the problem ‘which machine learning technique to use?’ Choosing the appropriate technique is essential for a successful project outcome. However, in my opinion the first question we should ask before considering techniques ‘what is the problem I am trying to solve?’. Understanding the business problem type can help us narrow down our options and ease the technique selection process. Also, the more we know about the problem, the better choice we will make since each machine learning technique has its own characteristics. Data mining addresses a number of problem types including data description and summarisation, segmentation, concept descriptions, classification, prediction and dependency analysis.

Now, let’s look at the above problem types more closely.

  • Data description and summarisation aims at describing the data in a succinct and intelligible form (as opposed to simply reading all your data). By simple statistical techniques the user can get an insight of the data’s structure. Typically, data description and summarisation are part of a data mining goal ordinarily applied at an early stage where the nature of the data is still unknown and the analysis goals are not precisely defined. For example, data description and summarisation of a product campaign by consumer age and gender could suggest which consumer groups need to be the focus of further marketing strategies.
     
  • Segmentation can be thought of as the process of splitting the data into conceptually meaningful groups where all members of each group share a set of characteristics not found in other groups. Segmentation can sometimes be a data mining objective but very often is a step towards solving other problem types, such as grouping a large dataset to reduce the size and simplify the analysis. A segmentation problem might be the grouping of consumer data according to their income, needs and preferences. Machine learning techniques for segmentation include clustering and neural networks.
     
  • Concept description can give a thorough understanding of concepts and groups that can be created by segmentation. For example, an company would like to learn more about the customers of their high value products and customers of their low value products in order to infer what could be done to transform the later to former or keep their customers within the high value bracket. By applying concept description methods the users can generate rules describing each customer category. Suitable machine learning techniques for concept descriptions include rule induction learning and conceptual clustering.
     
  • Classification deals with the creation of models (classifiers) used to predict the class or the probability of the class of previously unlabeled data. The classes are categorical and predefined and could be generated by segmentation methods. Classification is considered to be one of the most important data mining problem types employed by a broad range of applications. A classification task might involve trying to determine if an insurance claim represents fraudulent behaviour. A classifier can be generated by creating two classes, one for legitimate and one for fraudulent claims. The classifier can then be used to label new claims to one of these two classes or give a probability of a claim belonging to each class. Some of the appropriate machine learning techniques include decision trees, naïve Bayes, neural networks, discriminant analysis, rule induction learning, case-based reasoning, genetic algorithms and nearest-neighbour.
     
  • Prediction is similar to classification; it determines the value of a previously unseen target variable. The difference between classification and prediction is that classification is dealing with discrete data values while prediction works with continuous (numerical) data values. Predictive models can be used for turnover prediction of shares to earthquakes and more. Machine learning techniques for prediction problem types include regression analysis, regression trees, neural networks, genetic algorithms and nearest-neighbour.
     
  • Dependency analysis is employed on problems for finding interesting associations amongst data objects in order to understand their relationships. A typical example of using dependency analysis is to identify associations on shopping baskets (basket analysis) that can then be used by the business to deploy specific advertisement strategies. Machine learning techniques for dependency analysis include decision trees, regression analysis, association rules and Bayesian networks.

Now that we have a better understanding of the business problems we can solve with data mining, we shall look in more detail at the machine learning techniques available for each one…so stay tuned for Data Mining – What, Why and How – Part 2.

 

For those who want to learn about our current data mining projects and events our team is involved, read below:

Related Posts

Artificial Intelligence

Capgemini and Databricks partnership – multi-cloud Spark delivery at scale

Goutham Belliappa
July 13, 2018
Capgemini & Databricks – Multi Cloud Spark Delivery at scale. Meet our leaders Scott D Sweet, Goutham Belliappa, Steve Jones, Mansoor Aleem, Anne Laure Thieullent, Lee Brown at Microsoft Inspire 7/16 – 7/18 in Las Vegas
AI and analytics

Spotlight on Capgemini NA @ Informatica World 2018 | May 21–24 in Las Vegas

Jackson, Dusty
July 10, 2018
Spotlight on Capgemini NA @INFA World 2018 with key representation from Dusty Jackson, Scott Sweet, Keith Reid, Steve Jones, Goutham Belliappa and Mansoor Aleem
Data Science

Machine learning models, alternative data sources expand banks’ credit-scoreable population

Gunjan Aggarwal
June 14, 2018
With machine learning (ML) models, lenders can now directly implement algorithms that can assess customer risk and assign scores to customers with little or no credit history.
cookies.

By continuing to navigate on this website, you accept the use of cookies.

For more information and to change the setting of cookies on your computer, please read our Privacy Policy.

Close

Close cookie information