One of the biggest challenges operations groups will face over the coming year will be learning how to support AI- and ML-based applications. On one hand, ops groups are in a good position to do this; they’re already heavily invested in testing, monitoring, version control, reproducibility, and automation. On the other hand, they will have to learn a lot about how AI applications work and what’s needed to support them. There’s a lot more to AI Operations than Kubernetes and Docker. The operations community has the right language, and that’s a great start; I do not mean that in a dismissive sense. But on inspection, AI stretches the meanings of those terms in important but unfamiliar directions.
Three things need to be understood about AI.
First, the behavior of an AI application depends on a model, which is built from source code and training data. A model isn’t source code, and it isn’t data; it’s an artifact built from the two. Source code is relatively less important compared to typical applications; the training data is what determines how the model behaves, and the training process is all about tweaking parameters in the application so that it delivers correct results most of the time.
This means that, to have a history of how an application was developed, you have to look at more than the source code. You need a repository for models and for the training data. There are many tools for managing source code, from git back to the venerable SCCS, but we’re only starting to build tools for data versioning. And that’s essential: if you need to understand how your model behaves, and you don’t have the training data, you’re sunk. The same is true for the models themselves; if you don’t have the artifacts you produced, you won’t be able to make statements about how they performed. Given source code and the training data, you could re-produce a model, but it almost certainly wouldn’t be the same because of randomization in the training process.
Second, the behavior of AI systems changes over time. Unlike a web application, they aren’t strictly dependent on the source. Models almost certainly react to incoming data; that’s their point. They may be retrained automatically. They almost certainly grow stale over time: users change the way they behave (often, the model is responsible for that change) and grow outdated.
This changes what we mean by “monitoring.” AI applications need to be monitored for staleness—whatever that might mean for your particular application. They also need to be monitored for fairness and bias, which can certainly creep in after deployment. And these results are inherently statistical. You need to collect a large number of data points to tell that a model has grown stale. It’s not like pinging a server to see if it’s down; it’s more like analyzing long-term trends in response time. We have the tools for that analysis; we just need to learn how to re-deploy them around issues like fairness.
We should also ask what “observability” means in a context where even “explainability” is always an issue. Is it important to observe what happens on each layer of a neural network? I don’t know, but that’s a question that certainly needs answering. Charity Majors’ emphasis on cardinality and inferring the internal states of a system from its outputs is certainly the right direction in which to be looking, but in AI systems, the number of internal states grows by many, many orders of magnitude.
Last, and maybe most important: AI applications are, above all, probabilistic. Given the same inputs, they don’t necessarily return the same results each time. This has important implications for testing. We can do unit testing, integration testing, and acceptance testing—but we have to acknowledge that AI is not a world in which testing whether 2 == 1+1 counts for much. And conversely, if you need software with that kind of accuracy (for example, a billing application), you shouldn’t be using AI. In the last two decades, a tremendous amount of work has been done on testing and building test suites. Now, it looks like that’s just a start. How do we test software whose behavior is fundamentally probabilistic? We will need to learn.
That’s the basics. There are other issues lurking. Collaboration between AI developers and operations teams will lead to growing pains on both sides, especially since many data scientists and AI researchers have had limited exposure to, or knowledge of, software engineering. The creation and management of data pipelines isn’t something that operations groups are responsible for–though, despite the proliferation of new titles like “data engineer” and “data ops,” in the future I suspect these jobs will be subsumed into “operations.”
It’s going to be an interesting few years as operations assimilates AI. The operations community is asking the right questions; we’ll learn the right answers.
O’Reilly conferences combine expert insights from industry leaders with hands-on guidance about today’s most important technology topics.
We hope you’ll join us at our upcoming events:
O’Reilly Software Architecture Conference, New York, February 23-26
O’Reilly Strata Data & AI Conference, San Jose, March 15-18
Smart Cities & Mobility Ecosystems Conference, Phoenix, April 15-16