Model Governance and Structured Way of Working (WoW) On Data Science Platforms

Capgemini

2022-03-07

Nowadays, there are many organizations that they are in a transition of moving from traditional on-prem data warehouses towards building on-cloud data hubs, data layers, data lakes, data mesh or data lakehouses. All of these “data products” mentioned before must have as architecture principle a well implemented WoW for reusing data and governing repeatability of improving machine learning models which in essence means to add the technical capability of data freshness continuously for improving machine learning business use cases. This capability is even complex on organizations that builds data products which belongs to one or many data domains. Data is a living entity that changes every nanosecond and as new data is being ingested, it’s also needed to monitor and drift learning process of machine learning models. So, in fact implementing governance for the data science WoW, helps organizations on minimize cloud costs and maximizes revenue of the final data product. This is even more important in large organizations with many data science teams that needs to reuse many data features, and growth aligned with the business goals and given business services.

How to structure and implement model governance on data science platforms?

On data science platforms is important to have a cloud DevOps architecture principle implemented as a pillar of the continuous integration and continuous development architecture design but it’s also important to evolve the concept of DevOps to become MLOps to drift and deploy machine learning pipelines which in essence are the core of deploying machine learning models into production. Machine learning deployments are highly coupled with data that’s why deployment of machine learning models can become cumbersome if they are done manually, particularly when there are many teams, so I may strongly recommend to not do that but instead implement DevOps principles for it because it’s needed to deploy not only code but also data as a living entity and here model governance has a vital role in the process.

MLOps is DevOps but with focus for data scientists and machine learning engineers. It can be also defined as a machine learning engineering culture and a collection of best practices that aims at unifying and standardize machine learning systems development (Dev) and machine learning system operations (Ops). In fact, this machine learning engineering needs and uses and reuses data supported by a proper data feature store mechanism which assures data repeatability in terms of features. A feature store mechanism is part of the model governance and model drift for data science platforms.

Feature store is a technical capability in the machine learning data pipelines, why is important? because the feature store is the engine and repository that makes possible discoverability of data features and the reusage of data features done in the feature engineering process during initial exploratory data analysis. Feature store uses data freshness principle as part of finding a machine learning model that is suitable and needed to be achieve in terms of data accuracy or data classification. This is of course in the context of the data problem or data business use case that is being worked by data scientists and machine learning engineers.

But, what is Feature store?

First, it’s needed to know what a data feature is, in a machine learning model context. A data feature is an attribute that describes a data point per se. For example: a pixel is a collection of 1’s and 0’s but the data feature describes the colour of that pixel for instance. Hence, a feature store is the bridge between your model and the data being used for the model. The data becomes features on a machine learning problem, and this is done by a sub-process called feature engineering. This is mostly done inside an ipython notebook as a part of coding exploratory data analysis done by a Data Scientist role or persona. I may also pinpoint that feature store is an essential piece of technology for operationalizing the data science models and pipelines. In other words, feature store stores data features and serves data to build machine learning models.

Here below is an enumeration of some important characteristics of feature store technical capability:

Why is needed a Feature store as part of WoW on machine learning models?

This is a core part of machine learning governance and it’s needed because in the process of building machine learning models, the data is transformed into features and the machine learning models that uses those features needs to be trained and re-trained until a level of allowed data accuracy is reached to be deployed in production. This is an iterative process which is repetitive with fresh data being ingested every time. The data freshness can be achieved in batch, micro-batches or “online” which in essence is a continuous frequency of time-windows of micro-batches. Because of that iterative process, it’s needed a feature store that serves data to the machine learning modelling process. This process governs also model drifting in production and gives a lot of Key Performance Indicators (KPI) in terms of data science and mlops governance.

It’s also important to mention that another useful reason to consider work using feature store, it’s that brings value to the data and machine learning model drift in terms of tracking and traceability. It’s highly recommended to be used when there are many teams of data scientists and machine learning engineers running many batch or online machine learning pipelines.

Cloud service providers such as Amazon (AWS), Google (GCP) and Microsoft (Azure) does have indeed a feature store engine offered as a cloud service. For example, AWS uses feature store embedded on Sagemaker service, GCP have a feature store engine on Vertex AI and Azure mostly uses Azure Databricks feature or Azure Machine Learning for implementing feature store. There are also some vendors which has implemented the feature store functionality using SQL-feature store such as continual.ai which it’s an interesting option.

On the other hand, I strongly believe that using the open-source stack for building machine learning models is interesting to know and begin working with. TensorFlow technology have many functionalities for handling data as vectors which is exactly as feature store technology works. For instance, GCP has develop a TensorFlow Extended technology (TFX) which is used on GCP Vertex AI platform for managing feature store and machine learning pipelines. An important characteristic to mention is that TFX can run independent of GCP Vertex AI hence TFX can run independently of the cloud platform.

In summary, can you just imagine having an MLOps Control Center as shown below? Which can show your company automated model health monitoring, machine learning alert systems, feature store usage and gives an end-to-end model lineage and model catalog as a part of your Data Science Way of Working.

Image taken from AI Glass Box Capgemini v9.2

I’m thrilled to hear your thoughts about how to implement data science governance and how to scale them using MLOps and data science platforms on the cloud. Feel free to reach out to me, we can jointly implement data science governance and mlops for your data science teams.

About Author

Luis ”Lucho” Alberto Farje
Senior Data Solutions Architect
Luis-alberto.farje@capgemini.com

Lucho has worked on different business sectors but on latest years in the automotive industry as a principal data solutions architect mainly advising customers in the assessment of different business use cases to be implemented as data products including data science. This includes evaluation of architectural technical capabilities and feasibility of its technical implementation. He has worked many years under SAFe framework methodology, working cross-functional teams, supporting agile release train architectural roadmap for data foundations. He is an agnostic cloud architect and it’s passionate to build data foundations including devops, dataops and mlops.

Contact details: Luis-alberto.farje@capgemini.com
LinkedIn: https://www.linkedin.com/in/luchofarje/
Twitter: https://twitter.com/lucholuke