Paving the MLOps Roadmap

Contributed by Terry Cox, Chair, MLOps SIG

“I came because I heard the streets were paved with gold. When I got here, I found out three things: First, the streets weren’t paved with gold; second, they weren’t paved at all: and third, I was expected to pave them.”

This month sees the publication of the 2022 edition of the CDF MLOps Roadmap. This may be the third edition of the Roadmap, but the issues that brought it into existence remain a present and ever more pressing concern.

Machine Learning (ML) as a field of study has roots that go back to the 1940s and the early development of mathematics to attempt to model neural networks, however, as an engineering discipline, it has suffered from repeated setbacks with peaks of expectation dashed by long troughs of practical failure.

This means that, whilst the science has a long history, application in the real world is still very much in its infancy.

Recent studies show that around 80% of machine learning projects fail to make it into production.

This problem has two fundamental issues at its core:

Running ML in production has all the problems of managing conventional software assets plus a large number of complex new problems caused by the large volumes of data that must be managed as if they were source assets in development instead of as purely operational information at runtime.
ML practitioners generally come from a mathematical educational background and have minimal exposure to the best known methods of delivering and managing software systems.

A ticking time-bomb…

This has brought us to a worrying moment in time. On one hand, we have DevOps and Continuous Delivery, with lots of tried and tested tools and processes for safely and rapidly delivering conventional software assets into production. On the other hand, we have a selection of “MLOps” tools which have been built by Data Scientists to make it easy to take hand-crafted models and throw them over the fence into live production environments in a manner that harkens back to the bad old days of software delivery, treating models as if they were operational data rather than core assets in a software product lifecycle.

Diving deeper, we find a strange situation. In practice, the ML component of a modern “smart” product represents about 5% of the overall effort required to take that product to market. 95% of the product gets delivered using Continuous Delivery techniques with automated governance and testing processes and then the last 5%, that contains the vast majority of the risk, has to be put live by uploading models from laptops into active production servers, manually, with no governance to speak of.

Clearly, this approach is inherently far too risk-laden to continue for much longer without a major public incident occurring. Our concern is that, in an environment in which the public have already been sensitised to the perceived threat of “AI stealing their jobs”, there would be widespread backlash against a large public failure, leading to punitive regulation and yet another “AI Winter.”

As a result, we created the MLOps Roadmap, with the hope that we could help to realign efforts across the industry to provide better working practices and tools for managing ML in our products.

A better definition of MLOps

We define MLOps as “the extension of the DevOps methodology to include Machine Learning and Data Science assets as first class citizens within the DevOps ecology”. Starting from this basis, the Roadmap sets out to describe all the cross-cutting concerns that are shared with Continuous Delivery of conventional assets, and then details all the new challenges that are introduced into this space by the ML process and assets.

We highlight several key concerns this year:

There is still a fundamental need to educate Data Scientists on the fundamentals of software delivery, and to educate CI/CD system vendors on the challenges of managing ML assets in production.
We must accelerate work to extend existing CI/CD tools to address the challenges associated with ML assets.
We must develop new techniques to allow us to version massive datasets and effectively move very large datasets through training infrastructure at manageable cost.
We must implement formal, automated governance processes for release management of highly sensitive, high risk ML assets.

“Those that fail to learn from history are doomed to repeat it.”

Pragmatically speaking, we are running out of time. In several jurisdictions, dedicated AI legislation is expected to be put in place in the next year. Existing “MLOps” processes will not be able to comply with this proposed legislation in any meaningful way. It is thus now extremely urgent that we address this problem with effective tooling before we run out of time and regulators begin to intervene in this space.

The MLOps Roadmap can be found at here.

Information about how to get involved can be found at https://github.com/cdfoundation/sig-mlops

Terry Cox
Chair, MLOps SIG
terry@bootstrap.ltd