The machine learning lifecycle
“The Gartner Data Science Team Survey of January 2018 found that over 60% of models developed with the intention of operationalizing them were never actually operationalized.”
Based on our experience working together with various clients, we believe that this inability to operationalize is partly due to machine learning projects being seen as an inherently mathematical problem. By looking at the project from ideation to operationalization from a business, data science and engineering collaboration, the probability of success significantly increases with more realistic expectations from the start.
Enter: the machine learning lifecycle.
This initial post will summarize the different phases of the machine learning lifecycle, while in later blog posts we will deep-dive into each phase. The target audience for these blog posts consists of data scientists, data engineers and technical managers.
Machine learning is used for solving problems. Therefore, all machine learning projects should start with the recognition of a problem and the idea that machine learning could potentially be used to solve it. What follows is a collaboration between domain experts and machine learning experts to assess the viability of a machine learning approach to solve this problem. The following prerequisites are essential for a successful ideation:
- Clear requirements regarding business objectives and scope
- Availability of historical data
- Understanding of end-to-end IT infrastructure requirements
Acquiring these prerequisites can pose a challenge because of the different skill sets involved. Having someone on board who understands the business, the data science, and the required engineering efforts well enough can make a big difference. However, many projects cannot afford this luxury. In a later blog post we will discuss how this process can be structured.
After all preconditions are met, a high level planning can be made with regard to who needs to work on what tasks. The involved data team will find the required historical data. Data scientists will use this data for initial model development. Engineers or data analysts prepare the downstream systems that will use the predictions – whether that is some dashboard or an automated decision making system – and IT operations prepares the necessary infrastructure. Last but certainly not least, the business needs to adapt their business processes to incorporate the insights or automation provided by the data science solution.
Once key metrics or KPIs that correspond to the business objectives are agreed upon and historical data is acquired, the data scientist can start developing the initial model. Data scientists have a wide array of tools available to solve their puzzles:
- Transforming data to a more useful format
- Analysis of data to guide modeling approach
- Writing of the actual machine learning model code
- Creating numbers and visuals for initial reports towards stakeholders
Data scientists often spend a long time in this phase, iterating over many versions of the model. Even though earlier versions of the model might already create value, the transition between development and production costs so much time and effort that it is generally considered more efficient to build a better model first. By decreasing the friction of this transition using tooling and autonomy, value can be generated much faster.
When the development phase is over, the developed model needs to be put in production to start generating value. The complexity of getting a model in production depends on the context of the problem, the autonomy of data science teams and the overall maturity of the organization.
The context of the problem consists of a number of factors:
- Data flow at prediction time
- Sensitivity of the data
- Maximum acceptable latency of delivery
The autonomy of data science teams is determined by the number of required steps that can be taken by the team itself. These steps can include creating service accounts, getting or giving access to the production data systems and changing code in downstream software components. Less autonomy means stricter control over the full system, but also increased time in synchronization meetings and additional dependencies before a model goes live.
Maturity of the organization in this context comprises automation of processes, the amount of knowledge about the intricacies of machine learning projects and the speed at which an organization can process change.
Once a model is deployed, there are a number of measures that can be taken to improve robustness and quality of the machine learning model. These measures can be roughly divided into four areas. We call this post-production process maintenance.
The lineage of a machine learning model refers to the origins of the model, includes which source code the model uses, which data it was training on and what parameters were used. Having the full lineage available means that when a problem occurs it is easier to audit what caused the problem. Because machine learning models generate data when making predictions, this lineage can be added to the lineage of the data itself, which is important for certain compliance requirements.
Models are evaluated historically to estimate the performance. Discrepancies between historical data and a live environment can be disastrous for the real performance. Next to data discrepancies, data flows itself can also break or become corrupted. By setting-up proper monitoring, these issues can be continuously checked and if some erroneous behaviour is detected the development team and relevant stakeholders should be notified.
The world is constantly changing. More data is gathered and not everything can be measured based on only historical data. Therefore, it is important to have a framework that allows data scientists to make comparisons between different options in a live environment. These different options can include an algorithm against a simple baseline (e.g. a model A/B test), multiple algorithms against each other or the same algorithm trained on more data.
By the same token, because new data is collected regularly continuously, machine learning models can become obsolete when not maintained properly. This concept is called model drift. Drift can be detected by having advanced monitoring in place, Furthermore, by automatic retraining and redeployment of models in regular intervals or triggered by detected model drift, organizations can maximize the value of their machine learning project.
Nowadays, in most organizations, the lifecycle of machine learning models ends with the deployment of an initial model. For a machine learning project to be successful in the long term, it requires more attention with regards to lineage, monitoring, testing and model drift. These key components are often lacking due to missing tooling, inexperience and relatively high development costs. In further blog posts we will go in-depth into each of those topics.
Bringing a machine learning project to a successful conclusion is more difficult than it may seem up front. By taking the full machine learning lifecycle into consideration, we can understand the potential pitfalls and what is required to not only put a model in production but also keep it running. While this task may seem daunting, the tooling to help manage the machine learning lifecycle is getting more mature. Cubonacci, our machine learning lifecycle management platform, helps startups, scale-ups and enterprises with their machine learning lifecycles without compromising on flexibility. Next time we are going to deep dive into a number of the topics discussed.