Data Science Manager at Capgemini Invent
Data-based technologies are everywhere
In 2021, there is no need to remind anyone that “data is the new oil”, “data is not oil – it’s nuclear power”, or that “big data is at the foundation of all the megatrends that are happening”. Big data, data science, AI, advanced analytics – this new wave of automation became part of our everyday life, and it’s deeply embedded in our world. From artificial voice assistants capable of understanding basic commands through self-driving cars allowing safer rides, smart recommendations based on our preferences and provided by online shops or streaming apps, banking systems with online monitoring of suspicious actions, and smart home devices making our life easier and more comfortable. Even though these advancements are all quite recent, they are already in our lives and it’s hard to live without them.
Machine Learning is becoming a commodity
With that in mind, it’s important to understand that AI is here to stay and that most of the companies are already moving beyond pilots and are deploying more and more AI solutions in production. As Capgemini Research Institute report “The AI-powered enterprise” showed, in just 3 years from 2017 to 2020, the percentage of organizations which already use AI solutions rose from 36% to over a half of the population (53%). To succeed with productization of AI, companies must embrace the modern data science practices and technologies. MLOps, defined by Capgemini Research Institute as “a set of practices to shorten the time to update and go live of analytics and self-learning systems” is one of the core concepts when it comes to modern data science practices. We’ll show how it impacts data science landscape today. But to keep things in order, let us start with the traditional data scientist’s role and project lifecycle.
Data scientists used to be very research focused
Back in the day, data scientist’s role was research focused. An ideal candidate was a PhD in Mathematics or Computer Science with a solid academic track and a high Hirsh index showing good publishing skills. They would know some scripting language dedicated for statistical programming such as Statistics, SAS, Matlab, or R. Depending on when they did their PhD, they would also know some basics of FORTRAN, C, or Python. They would have various level of business expertise, ranging from domain experts to experts who were mostly interested in conducting research to further boost their academic careers. For the latter, being in business was intended to help their academic track – not the other way around.
|Figure 1: Three pillars constituting data science identity: while mathematics and statistics used to be the most important pillar constituting data scientist identity, with the advancement of MLOps capabilities, computer science gained on importance. Business expertise was and remains a third, complementing piece of data science profile.|
Standard projects from couple of years ago would consist of scoping, data acquisition, modelling, and results sharing phases. In other words, a team of data scientists would agree with their business stakeholders on what problems they would like to solve, and then they would reach out for data, prepare and preprocess it, run various different models on it, gather insights, pack them into slides and share those results with stakeholders. In case of more complex data structures, they would be assisted by data engineers, who specialize in extracting, transforming, and loading data coming from various data sources.
Before getting a final approval from the business stakeholders, the team would have to iterate this process a few times to adjust questions, data, or methods they used. In the end, they would get a sign-off and the final results would be presented to decision makers as slides with gathered insights, or as a blueprint for a potential solution (which then would be hand overed to and implemented by a software engineering team).
|Figure 2: Standard data science project lifecycle from pre-MLOps era.|
MLOps allows for faster and easier productization
However, it became apparent that such operating model was not sufficient and could be sped up. Keeping data scientists focused mainly on research and ad hoc analyses was having detrimental effects on the productization of their solutions. As the field was becoming more mature and popular, better tools for MLOps were created and this solved the problem of slow productization of machine learning solutions.
Therefore, while there is still a high demand for research-focused data science experts, a more common expectation is that a data scientist will be able to invent a solution to a problem and cooperate with, or work within a software development team to implement it into production.
And “it” – the solution – is defined as the whole automated pipeline including automated data acquisition, model retraining (including automated model validation) and deployment with set of rules defined for when should the system rerun the whole cycle again. This is one of the most critical aspects to understand when discussing the MLOps: a final product is not a single model anymore – it is the whole pipeline. While it takes some additional effort to build it at first, the fact that it is highly automated allows for better scaling and, consequently, less work later on.
|Figure 3: Modern data science project lifecycle. After a project is scoped, data science team builds a pipeline, where each step is automated, as much as it is possible. In case the business would like to change the scope, data science team would simply adjust the pipeline.|
Rising demand for more scalable machine learning solutions is especially visible when comparing the popularity of programming languages over time. While the popularity of R started to rise around 2010 (from 2016 with minor fluctuations stagnated on the level 1-2% popularity), Python, became increasingly popular in the last 3 years (rising from approximately 3% to 12% in popularity). Since Python is not used solely for statistical purposes and can be used as a programing language per se, it’s easier to create stable, scalable solutions in Python than R.
Typically, to speed things up, machine learning solutions can be containerized for sharing and deployment via Docker (released in 2013), Kubernetes (2014) and then deployed in cloud environment such as Amazon Web Services (first infrastructure service for public usage launched in 2004), Google Cloud Platform (2011) or Microsoft Azure (2010).
It is worth noting that most of these solutions are less than 10 years old – and most of them are constantly updated and reshaped (e.g. machine learning services in most of the cloud platforms are being heavily developed as we speak, offering more and more functionalities, such as feature stores for automated feature creation, better code and data version control, and others). Therefore, MLOps experts are in a need of constantly refreshing their knowledge, and MLOps itself is a set of constantly updated practices.
If you’re interested to read more about exemplary MLOps usage – we will soon share a post describing such use case in detail. For now, let us conclude with an important question.
Do YOU need MLOps?
After all, if you are looking for data science experts you might be already tired of small supply of qualified candidates and adding additional requirements is not necessarily helpful in recruiting. On the other hand, if you are a data scientist yourself, you might be already exhausted by the continuous stream of new knowledge, tools, and practices to learn.
The single, basic questions you might want to ask yourself is “What are the projects I would like to run?”.
If the answer is “ad hoc analyses and short research projects”, then, at least for now, you do not need to worry about MLOps. However, if you are planning to create some machine learning models which would automatically retrain themselves and work as a part of bigger systems, or going bolder – you would like to reshape your company by intertwining AI into its key systems – then you need to take MLOps very seriously.