Testing strategy for a deployed machine learning system

Publish date:

Machine learning system testing has always been considered as boring and unimportant, as more and more data science projects move from Proof of Concept (PoC) to production, effective testing has become more and more relevant in the data science world.

Apply testing strategy through machine learning system development

A machine learning system development usually consists of three phases: experiment phase, development phase and production phase.

The experiment phase is the core of a machine learning system development because the data science process is very research centric, through the experiment phase, data scientists test different algorithms and model configurations until they reach a satisfied result. In the development phase the model from the experiment phased will be deployed for production usage, software engineers are required to implement unit, integration and differential tests to ensure the performance of the deployed model is as good as the models in the experiment phase. In the production phase, the system is used to make predictions against live data and the focus of testing is to monitor the performance and accuracy of the system. In this blog I will only cover the testing strategy for the deployment phase.

Testing strategy for a deployed machine learning system

To distinguish machine learning system testing with traditional software system testing, I will use the following two pyramids to demonstrate the differences. The pyramid on the left is the test pyramid that most traditional software system testing has applied, it only contains three type of tests namely unit test, integration test and user interface(UI) test. The test pyramid on the right is the one that I have applied on my current data science project and it was based on the testing strategy proposed by software engineer  . He suggested that instead of testing a machine learning system as a whole, we should create separate pyramids for each of the artifact data, that is to say data, model and code as shown in the second pyramid. Once each of the artifacts have passed the tests, we should then think about how to test the combination of them as shown in the graph. Please also keep in mind that the following pyramid is just an example, as you added in more tests into your system, you should rethink the shape of the pyramid and check if they are still feasible to be integrated into the system.

Test cases that you should consider including in your deployed machine learning system

Let us now take a close look of some test cases that are essential for the majority of machine learning systems:

Unit tests

A machine learning pipeline usually consists of multiple components and each of the component also has multiple functions in it. A unit test usually being implemented to test the functionality of those predefined functions. These unit tests should be implemented once the system is complete.

Valid input and pre-processed data: An enterprise company may use features from different sources and there might be multiple people or teams involved in the data processing activity. Therefore writing input data and pre-processed data unit tests can ensure that the input features always fulfilled certain criteria and there is not any bias introduced from the raw input data either.

Validate model config: It is one of the most important tests that usually being neglected by data scientist; We usually keep our key setting in the config file which means even a small change in the configuration may have a big impact on the entire system. A classic example of model config test is that after running many experiments, the data scientists learnt which model setting is optimal and which settings result in bad model performance, we should then test if the configuration in the system validates all the previous assumptions.

Validate model quality: Model quality testing is about testing the quality of the model before it affects the real predictions in production. This can be testing the performance metric of the model against a threshold or result from the previous run.

Validate model bias: Sometimes the performance of overall test is very good, and we also get good metric from the validation set. It is then important to check how the model perform against baselines for a specific data slice to validate there is not any bias introduced from the training data.

Integration tests

Integration testing is to test if separately developed components produce the result as expected. It is a terminology we usually use for machine learning pipelines, in a machine learning pipeline changes in one component can cause errors in other components, therefore it is important to include integration tests to ensure the components work together seamlessly. In my current project, we test all intermediate data sets that generated between pipeline steps, for example when the training data set is created from the data preparation step, we want to make sure the training data set fills certain criteria and won’t cause the following steps to fail. The second test that I found useful is to check if the expected outputs have been saved and registered in the right places.

Differential test

Differential testing is also known as the back to back test. A differential test compares outputs from a version that is under test with outputs from a released version. Given that we are dealing with machine learning, a new version usually means a change of model algorithm or configuration of an existing model. This test can be done by ingesting same input dataset into the test version and the released version to check if the differences between versions are acceptable; in this way we can ensure the next released version is reliable.


As machine learning technologies continue to evolve and get more complex, having an effective testing strategy is essential. By integrating effective testing into projects, we can better manage the risks and uncertainties introduced by the data and machine learning models.



Wanmeng He

Senior data scientist in the AI&A team.

Related Posts

Artificial Intelligence

Towards federated learning

Date icon November 27, 2020

New machine learning paradigm to overcome regulatory barriers and win back trust

Applied Innovation Exchange

Failing machine learning (ML) projects in 2020 like it’s the mid-2000s

Date icon November 11, 2020

Why ML projects still fail because of preventable and easily understood mistakes?

Artificial Intelligence

What is Simpson’s Paradox? Why correlation is not always straightforward

Date icon November 10, 2020

It’s tempting to think that a correlation relation between just two variables is...