What is MLOps

What MLOps is and why you need to know about it

Solving the combined concerns of the development, deployment and operational phases of a machine learning application is called MLOps. In this post we will explain what MLOps is, and why MLOps is important to extract value from machine learning.

What MLOps is - introduction

MLOps is a compound of Machine Learning and Operations. It refers to the people, culture, technology, processes and skills required to successfully apply machine learning models to business critical processes.

Like its counterparts DevOps and DataOps, MLOps describes a form of collaboration and a set of practices. For MLOps, the collaboration is usually between data scientists, machine learning engineers, business analysts and IT operations professionals. Together, they work on the machine learning lifecycle of applications in production, from development to deployment.

The set of best practices includes release, test and quality assurance automation, continuous integration (CI), continuous deployment (CD), continuous training (CT), quick issue resolution and extensive monitoring.

How to build an MLOps culture?

Organizing MLOps requires the following:

  • A cross-functional team capable of designing, developing, operating and maintaining a machine learning application in production
  • An automated process for versioning, validating, deployment and delivery of the machine learning application
  • The right infrastructure for training and deploying the models
  • A monitoring process and good analytics to guarantee the quality of the output of the application
  • The right tools to facilitate these different facets 

Cross-functional teams

Machine learning applications solve difficult real-world problems. They use advanced statistical modeling techniques and elaborate data transformation. Often, they are deployed within complex application landscapes. On top of that, machine learning systems can have a wide variety of software dependencies and have a very dynamic infrastructure footprint. 

No one can be specialized in all these domains. Therefore, a team responsible for machine learning applications must be cross-functional, having a healthy mix of specializations. Together they must be able to design, develop, operate and maintain a machine learning application in production.

While no circumstance is equal, profiles often seen in a productive MLOps team are:

  • Data scientists
  • Machine learning engineers
  • Data Engineers
  • Business and/or data analysts
  • Site reliability / DevOps engineers
  • Software Engineers
  • Infrastructure specialists

Automated process

To be able to deploy fast and frequently, an automated deployment and testing process is necessary. Additionally, when a machine learning application in production causes issues, a machine learning team needs to fix issues quickly and structurally. They often have a workflow similar to:

  1. Analyse the issue to find the root cause
  2. Fix the root cause on development
  3. Deploy the code after it passes peer review
  4. Build the new version
  5. Validate it with automated tests
  6. Release to the next stage
  7. Repeat until production has been updated
  8. If problem persists: GOTO 1

Only when this workflow can be executed automatically, can a quick fix take this preferred, and highly controlled route. Fixes land in the development repository, are validated, built, released and deployed.

To achieve automated workflows as described, a healthy degree of standardization of the workflow steps must be achieved first. 

The Right Infrastructure

The core elements that a production ready machine learning platform needs to train and deploy models are infrastructure, software and a container orchestration service. 

Infrastructure

The infrastructure footprint of the machine learning workflow is very dynamic. Sometimes you need a massive amount of compute for batch processes, at other times a model requires a GPU machine and is optimized for minimal latencies. The claims on the required infrastructure need to be managed as dynamically as the requirements come and go. The cloud was designed exactly for this purpose, and many modern on-premise data centers have similar dynamic allocations of infrastructure.

Software

One model runs in Python using a Tensorflow 2 backend. The other requires a previous version that is incompatible. Maybe you are running a proof of concept with the inference of a model written in Go, which needs a whole other set of dependencies. The way to solve this dependency problem is to isolate every environment in their own container. Although not exactly the same, containers can be described as lightweight virtual machines.

Container Orchestration

Containers are stored in a container registry and need an execution runtime like Docker. When there are many interdependent containers, they need to be managed. This is where Kubernetes comes in. Kubernetes is an open source container orchestration platform by Google. It handles concerns like networking, dynamically spinning containers up and down, and security. It’s an engineering feat and a complex piece of software. 

Monitoring and Analytics

Performance of a model can usually be estimated after training, based on historical data. Numerous things can go wrong when the model is running in production however. The underlying process can change, the data definition can change or there can be an error in one of the data pipelines.

How can we know the application is still working as intended? By using monitoring to answer the following questions:

  • Do the distributions of the input features remain similar enough to the distributions during the training phase?
  • Is the distribution of predicted labels similar to the distribution of the labels in the training set?
  • Is the output of the model within the bounds of plausible outputs?
  • How do your performance metrics on the actuals compare to the metrics when the model ran on the test set? 

The Right Tools

Meeting the requirements listed above requires advanced tooling. Employing the right tools does not automatically make a team follow the MLOps philosophy, but without the right tooling it is impossible to achieve best practices. 

An MLOps platform should manage all of these requirements. There are a number of emerging tools being released that cover this space, and most cloud vendors are adding services targeted for operating machine learning models in production. 

Cubonacci, the MLOps platform I co-founded, originated from our frustration of having to repeatedly re-engineer tasks that data science teams should not have to do. We did not find a solution that allowed for high developer flexibility and that fitted naturally to the data science workflow. So we decided to build one ourselves. 

We aim to deliver a user-friendly, integrated solution that offers very high model flexibility. Every real-world problem that can be better solved with a machine learning application needs a unique approach. Therefore we designed the platform to be code-first. 

Git is the interface to Cubonacci. To work with your code repository, we need a healthy degree of standardization in your repository. This enables friendly user interfaces and end-to-end workflow automation, while gently enforcing best practices. Having a git commit at the start of the workflow, makes it possible to track every experiment, model, deployment and even every call of a deployed API back to the version of the code it is based on. Not having to manage that is a huge relief for users. 

If you are interested to try how this setup works for you, you can sign-up for the beta of the free tier of cloud-hosted Cubonacci. The Getting Started guide, which links to the beta, can be found here: docs.cubonacci.com/getting-started.