How data quality can hurt your data science programme… if you’re not careful

Publish date:

Data quality isn’t often thought of as the most exciting aspect of data science, but that doesn’t take away from its critical importance.

With growing investment around Artificial Intelligence and machine learning, and near-daily success stories in the news, organisations in traditional industries are investing in data science like never before.

However, many of these companies must deal with legacy systems, a lack of core data skills and poor data quality.

The consequences of poor data quality can be enormous. In a research study published in MIT Sloan Management Review, companies are said to be losing around 15% to 25% of their revenues due to poor data quality.

Poor data quality has even been cited as a factor in disasters including the explosion of the space shuttle Challenger and the shooting down of an Iranian Airbus by the USS Vincennes.

One consequence of poor data quality is that knowledge workers waste up to 50% of their time dealing with mundane data quality issues. For data scientists, this number may go as high as 80%.

Data science initiatives will likely not produce their expected results if there aren’t strong data foundations in place.

A Kaggle survey in 2017 of professionals in the data science domain showed that dirty data was their number one challenge (see Figure 1).

Figure 1 - Kaggle 2017 ML & DS Survey - What barriers are faced at work?
                                                                  Figure 1 – Kaggle 2017 ML & DS Survey – What barriers are faced at work?

Although there are several ways data scientists can tackle data quality issues, data quality should be an organisational-wide priority.

Beyond the importance of quality data in training machine learning models, data quality affects pretty much every process or decision that relies on an organisation’s data.

What do we mean by data quality?

Data quality is a broad term but can be considered across six key dimensions:

Completeness – are all data sets and data items recorded?

Consistency – can we match the data set across data sources?

Uniqueness – is there a single view of unique data attributes?

Validity – does the data match defined rules?

Accuracy – does the data reflect the real value?

Timeliness – is the data available when required after it was entered or gathered?

How good is good enough?

In any company, it is certain that data will never be 100% perfect. There will always be inconsistencies through human error, machine error or through sheer complexity due to the growing volume of data companies now handle.

So that leads to the question, how good is good enough for the purposes of data science?

This depends on the business requirements for model accuracy. Explaining terms such as precision and recall to business stakeholders is necessary to understand whether the data and model quality is good enough for their use-case.

Fortunately, there are also many techniques that can be used when developing machine learning models to address data quality issues, including:

  • Imputation of missing values
  • Outlier detection
  • Data standardization and deduplication
  • Handling of different data quantities
  • Analytical transformation of input variables
  • Selection of variables for predictive modelling
  • Assessment of model quality

However, these approaches can only you so far and it is the whole organisation’s responsibility to improve the quality of their data.

How can data quality issues be addressed by an organisation?

Data quality culture

Establishing a strong culture of data quality is paramount and must be initiated at the top of the organisation. There are nine-steps for organisations that wish to improve data quality:

  1. Declare a high-level commitment to a data quality culture
  2. Drive process reengineering at the executive level
  3. Spend money to improve the data entry environment
  4. Spend money to improve application integration
  5. Spend money to change how processes work
  6. Promote end-to-end team awareness
  7. Promote interdepartmental cooperation
  8. Publicly celebrate data quality excellence
  9. Continuously measure and improve data quality

 Data quality software

There exists a plethora of software solutions to manage and improve data quality that include a range of critical functions, such as profiling, parsing, standardization, cleansing, matching, enrichment and monitoring. A 2019 Gartner report (see Figure 2) evaluates 15 vendors for data quality tools.

Figure 2 – Gartner 2019 Magic Quadrant for Data Quality Tools
                                                                             Figure 2 – Gartner 2019 Magic Quadrant for Data Quality Tools

Some of these vendors, such as Informatica’s CLAIRE, have recently incorporated machine learning and artificial intelligence into their product offering.

How can machine learning and Artificial Intelligence help improve data quality?

As well as being reliant on good quality data, a growing number of early adopters are turning to machine learning and artificial intelligence to automate processes for cleaning data. Some of the applications of ML & AI to data quality are:

Named Entity Recognition – In order to retrieve important entities from unstructured data such as persons, organizations and locations, Named Entity Recognition (NER) is a Natural Language Processing technique to automate this process.

Record Linkage (Matching) – Probabilistic Record Linkage has been used for many years in a variety of industries to match individuals, locations or objects across different source systems. While this method can produce useful results, it is now possible to improve matching accuracy by using Machine Learning and Neural Network algorithms.

Text Classification – Another natural Language processing technique, Text Classification can be used to automate the classification of unstructured text. For example, tagging products into new categories based on their product descriptions or ingredients.

 The Future of data quality in machine learning and Artificial Intelligence

As we move through the new decade, more companies will be deploying machine learning models to make business critical decisions. We will likely see some high-profile incidents where decisions have been made by models trained on poor quality data, leading to regulatory fines and expensive rectification processes.

Organisations that have done the hard work by introducing a culture of Data Quality and started to automate the management of their data using ML & AI will see the benefits and have a far higher success rate in running data science initiatives.


Jon Howells, Lead Data Scientist, Insights & Data UK

Jon is a Lead Data Scientist within Insights & Data UK. He has over five years’ experience, leading data science projects across FMCG, Public Sector & Financial Services. Prior to a career in data science he studied Computational Statistics and Machine Learning (MSc) at University College London. His passion lies in helping clients unlock the tangible value from their data by creating effective data products and strategies using data science and machine learning.

Related Posts


Accelerating drug discovery with Artificial Intelligence (AI)

Date icon June 4, 2020

AI is emerging as an essential tool for transforming the drug development process. With AI we...


Is reinforcement learning worth the hype?

Date icon May 28, 2020

“I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if...