AI Model Lifecycle Governance
A primer on practical and real approaches to managing AI models through versioning, validation, rollout and monitoring.
In a previous couple of articles I wrote about practical approaches to data governance and promised to return to model governance later. So here goes - I hope you enjoy this primer on some proven practices for AI model governance that I’ve found most effective and pragmatic.
To start exploring good model lifecycle governance, let’s start with a lesser-known AI incident from 2020, one that might have been relatively innocuous if not for what it indicated about the model governance in place and how it needed to change: Twitter’s image cropping algorithm.
When users uploaded multiple images or tall images, Twitter would automatically crop them for preview in the timeline. At first glance, it seemed like a relatively harmless feature. The goal was to pick the most “salient” part of the image—something visually interesting or likely to catch attention—so users could quickly scan through their feed. Behind the scenes, a machine learning model was making that crop decision.
But users started noticing something odd. When they uploaded side-by-side portraits of people with different skin tones or genders, Twitter’s algorithm seemed to consistently favour white faces over black ones, and women over men, when deciding which part of the image to preview. This wasn’t just anecdotal. People ran informal experiments to test it, and the results were remarkably consistent. No matter how the images were arranged, the crop focused on lighter-skinned faces.
What made this story so revealing wasn’t just the bias itself — it was how Twitter had deployed the model without fully understanding how it behaved across different use cases and demographic groups, and then at least at first struggled to respond. It didn’t appear they performed any bias testing before launch, and there was no documentation on what the model had been trained on, or how it was evaluated. When users flagged the issue publicly, the company initially struggled for some time to even explain what was happening or why.
Twitter eventually acknowledged the problem and to their credit responded very admirably1. They ran their own internal analysis, confirmed the bias, and published a detailed technical post-mortem that is well worth a read2. They ultimately decided to retire the cropping algorithm entirely. From that point forward, Twitter shifted to showing full uncropped images by default—essentially removing the model from the loop. Although it took time, it was a rare case of an organisation taking full responsibility, being transparent, and choosing to simplify the product rather than maintain a flawed AI system.
But the lesson here runs deeper, for us, and I’m sure for Twitter (although perhaps it’s forgotten in the transition to X). This wasn’t a model that failed because of poor accuracy, it failed due to a lack of model lifecycle governance. There was no structured fairness evaluation before launch, no clear escalation path when issues were raised, no audit trail linking the model’s behaviour to design choices or data provenance. And crucially, there was no apparent plan for handling user feedback or retraining once the model was in the wild. What looked like a small utility feature became a very public reminder that when you don’t test for inclusion, exclusion becomes the default.
This is why model governance matters. Not just for high-stakes systems like credit scoring or healthcare, but even for the small, subtle AI systems that shape how we experience the world online. When you don’t govern the lifecycle, when you skip validation, ignore documentation, and leave no way to monitor behaviour once deployed, you hand over control to the system, without knowing who it might quietly leave out.
Just as software engineering has practices for code (version control, code reviews, continuous integration), machine learning does have analogous practices for models, although they are as yet less mature and far less broadly implemented. You hear this described as MLOps or ML lifecycle management. Model governance is about instilling discipline in how we build, evaluate, deploy, and maintain models. Among other things, it’s about ensuring that there’s a single source of truth for “which model is running where,” that every model has an owner, that changes are tracked, and that the model’s behaviour stays within expected bounds over time.
I want to start by with an essential primer on the four key components of model governance:
Versioning and lineage
Validation and testing
Controlled deployment and rollback
Monitoring, drift detection and feedback
Versioning and Lineage for Models
One of the most essential practices in model governance is versioning. It sounds basic, but in reality, I’ve seen teams end up with a folder full of files like model_final_best.h5
or model_final_V2.pkl
, and no one quite remembers which one actually got deployed. That’s exactly the kind of ambiguity you want to avoid.
Every trained model should have a unique version identifier, along with metadata that explains how it was created. In software, we use tools like Git to track changes. With machine learning models, the idea is similar. Whether you’re using Git-LFS, DVC, MLflow, a commercial tool, or just a solid naming convention with a tracking log, what matters is being able to trace each model back to the exact code, data, and parameters it was built from.
A good versioning approach connects three elements: the training code version, the dataset version, and the model weights file. That could be a Git commit for the code, a dataset ID or hash for the data, and a model artifact stored in your registry. Many teams use experiment tracking tools like MLflow to capture all of this automatically. Even something as simple as a spreadsheet can work, as long as it’s consistent and accurate. But wherever possible, I recommend automating it. Some teams integrate version logging directly into the training pipeline so that every completed run registers itself in the model registry with a new version number and all the relevant metadata already attached.
Lineage is just as important. If you’re starting from a pre-trained model, especially one from a public repository or a commercial provider, that base model becomes part of your model’s story. You should document exactly where it came from, which version you used, and what fine-tuning or adaptation was applied. Similarly, if one model is simply an updated version of another, trained on more recent data or with a different configuration, that relationship should be tracked. If a flaw is found in the original, you’ll want to know exactly which downstream models might be affected.
It’s also worth remembering that models are rarely just standalone files. They often rely on configuration artifacts: things like preprocessing steps, feature encoders, vocabularies, or parameters. For example, a text model might use a specific vocabulary-to-index mapping. If that mapping doesn’t match at inference time, your predictions could be meaningless. That’s why I always recommend storing these artifacts alongside the model or clearly linking them through your versioning system.
The goal here is traceability. You should always be able to answer a simple question: which model is currently running in production, and how was it built? And if someone asks about a decision made by the system, whether it was yesterday or six months ago, you have to be able to trace it back to the exact model version, the data it was trained on, and the code and configuration used.
In regulated industries like finance, this kind of traceability is already expected. Every model must be documented, reviewed, and approved before it goes live. But even in less regulated environments, it’s still good engineering. Models should be versioned and managed with the same discipline you apply to code. And that discipline starts with versioning and lineage that gives you confidence in every decision your model makes.
Validation and Testing Before Deployment
Before any model goes live, it needs proper validation. Not just a quick check of the accuracy score, but real validation. You’re not just asking whether the model performs well in testing, but whether it’s fit for purpose in the real world, safe under pressure, and fair in its decisions. I’m hoping to write a whole series of articles that go much deeper into model auditing and evaluation, but for now let’s just go through how it might look at a high level in practice.
Performance Evaluation. First, check your core metrics—accuracy, F1 score, RMSE, whatever’s appropriate for the task. This is the first gate because there’s no point safety testing a model that just doesn’t work. And make sure you’re evaluating the model on a hold-out test set that wasn’t used during training or hyperparameter tuning. If you can, test it on multiple datasets that reflect different user groups or edge cases.
For example, if you're working on a face recognition model, evaluate performance across different ethnicities and genders. If you see a drop in accuracy for a particular group, that might point to a bias in the training data, or a deeper issue with model generalisation. Ideally you’ll have set acceptance criteria ahead of time: maybe your model needs to hit a certain precision and recall overall, and no group can have more than a 5% drop-off compared to others. If those benchmarks aren’t met, the model doesn’t ship.
Robustness Testing. Next, test how your model handles messiness. The real world doesn’t send clean inputs. So throw a few edge cases at the model, adding noise to your data, distorting images, feed it misspellings or slang, tweak inputs just slightly and see what happens. You’re checking for brittleness. If the model changes its predictions wildly with a small change in input, that’s a red flag. For some systems, especially generative models or anything that works with user-generated content, it’s also worth doing adversarial testing. Try to break it on purpose. Find the weird edge cases so you stay ahead of angry users or viral failures. If the model misbehaves, figure out if you need to retrain, augment your data, or put guardrails in place.
Fairness and Bias Audit. Even if your dataset passed its bias checks, it doesn’t guarantee the model is fair. Sometimes, models amplify the patterns they learn. Maybe your training data showed that 10% of Group A had negative outcomes and 5% of Group B did, but now your model is predicting 15% versus 3%. That kind of amplification happens more often than you might think3. You want to check your predictions across different subgroups using fairness metrics: disparate impact, false positive and false negative rates by group, calibration differences, and so on. If you have ground truth data, great, but if not, you might rely on proxies or expert judgement.
If you find a problem, don’t sweep it under the rug or just hope it will go away in later training. Create a mitigation plan. That could mean retraining with bias-aware techniques, rebalancing your inputs, or post-processing the outputs. For high-impact systems, external reviewers can help at this point (provided they actually know what they’re doing - easier said than done). A fairness, ethics or other AI governance panel can help assess whether the model is acceptable for deployment. Better to catch these issues early than to explain them in a press release.
Compliance Check. Now step back and ask: does this model meet all legal and internal requirements? If the system needs to be explainable, do you have a way to show how it makes decisions? If it touches personal data, have you reviewed it with legal or privacy teams? If the model introduces a new type of data processing, does your privacy policy need updating?
If you’ve done light assessments and checklists during the development process, you’re hopefully in a good place by now. If not, you may have to pause deployment completely, waiting for the legal, risk and audit process to happen. It’s incredibly frustrating to have a high performance model, that passes safety tests, and you can’t deploy because of a compliance audit, so it’s best to start this as early as possible. Ideally during the design phase.
The Model Card. Then before deployment, wrap up your documentation, in specifically the Model Card for this version. This is your short, structured summary of everything that matters: what the model does, how it was built, what data it used, who trained it and when, what the key performance and fairness results are, and where the model has known limitations.
Model Cards are becoming a best practice in safe, responsible AI. They don’t need to be long, but they do need to be honest. If your object detection model struggles in low-light environments, say so. If the model isn’t calibrated for users outside your core markets, include that. These cards help internal teams make informed decisions, and if needed, they give you a way to explain the model to users or regulators later on. Do everyone a favour and identify actual risks in your model cards and don’t just copy from another4 (as happens in many cases).
Sign-off and Go/No-Go Decision. Finally, no model should go live without someone signing off on it. Whether that’s a formal governance committee or just the product owner and tech lead, someone has to take responsibility. In high-integrity environments, I like to keep a written record: “Model version 2025.1 approved for deployment by Jane Doe (Head of Data Science) and John Smith (Business Owner) on 2025-03-01, following review of validation results.”
That one sentence does a lot of work. It shows that someone actually reviewed the model. It creates ownership. And it slows down anyone trying to rush a half-tested model into production. In regulated sectors like finance, this step isn’t optional. But even in a fast-moving startup, a lightweight version of this can save you from major headaches later.
This all comes down to approaching validation as a high integrity process, not a checkbox. it’s the moment where you decide whether the model is genuinely ready. Get this part right, and you’ll catch most of the serious issues before they reach your users. Get it wrong, and you might not realise there’s a problem until someone flags it on X, or in court.
Controlled Deployment and Rollback
Once a model has been validated and signed off, it’s tempting to think the job is done. But deployment is its own phase of governance, and it needs just as much care. This is the moment the model meets the real world, and that transition should never happen all at once. The principle here is simple: go gradual, and make it observable.
A robust deployment doesn’t flip a switch. It starts small. A good approach is a canary release, where you send just a small portion of traffic, maybe 2%, to the new model while the rest continues using the current version. That gives you a safe way to compare outcomes side by side and spot any unexpected behaviours. An even better approach (although it doesn’t work for all circumstances) is to run a shadow deployment. This is where the new model doesn’t affect anything yet, but it runs in parallel and logs what it would have done. You can then compare those logs against actual decisions from the live model. Either way, the idea is to give yourself a chance to catch issues with real data, before they impact real users.
But this only works if you’re actually watching the right things. Part of model governance is specifying what metrics matter during rollout. That might be accuracy, conversion rate, latency, error volume, or even indirect signals, like a spike in customer support queries or commentary on social media. If any of those metrics degrade or drift outside of expected bounds, deployment should pause. Better yet, you define clear thresholds in advance. For example: “If the new model’s key metric drops more than 10% for over 30 minutes, get ready to roll back immediately.”
Rollback Planning. And that brings us to the next piece: rollback planning. You want to always have a way to quickly revert to a stable version if the new model causes trouble. That might mean keeping the old model loaded and ready, or having a script in place that can redeploy it in seconds. The important thing is not to wait until things go wrong to figure out what your fallback is. Time matters when you're dealing with a live system. Teams can waste hours debating whether to roll back while metrics are tanking, when it would have been easier to make the decision ahead of time about what unacceptable looks like.
In anything but very small deployments a bit more formalism is needed. You could use a checklist: Have all tests passed? Is monitoring active? Has rollback been tested? These checklists are helpful, not just for accountability, but to slow the team down just enough to make sure nothing important has been missed. For high-impact models, change management forms that record the purpose of the update, the validation outcomes, the risk assessment, and the plan for launch. It doesn’t have to be heavy-handed—but there should be a record of what you’re changing and why.
Technical configuration matters too. The environment a model is deployed into should match the one it was tested in. Models that worked beautifully in development fall apart in production because of different library versions, mismatched dependencies, or a container that didn’t quite match the test build. Sometimes it’s even simpler: the model’s just too big for the memory allocated, and everything slows to a crawl. Good governance here is really just good engineering. Ideally, you’ve containerised the model and bundled all its dependencies, so that what you tested is exactly what gets deployed. And you’ve thought through scalability. If your model relies on GPUs, can your infrastructure handle demand spikes? Does it autoscale properly? These aren’t always seen as governance issues, but they should be. A model that fails because of deployment problems still fails.
There’s one final piece of deployment governance that often gets missed: user communication. Sometimes, a new model changes what users see or how they experience your system—like a search ranking tweak, a content moderation filter, or a product recommendation update. If that change is noticeable or consequential, you need a plan for how to communicate it. Maybe it’s a policy update. Maybe it’s a notice in the UI. Maybe it’s just a heads-up to your support team so they’re ready to answer questions. The point is: deployment isn’t just a technical handoff. It’s a moment where people are affected, and governance needs to include them in the loop.
Done well, deployment is smooth, measured, and transparent. It gives you confidence that the model is ready. And if something unexpected happens, you’ve already got the systems and processes in place to step in and fix it quickly. That’s what real model governance looks like where it’s not just launching a model, but owning everything that happens after.
Sidebar: Model Governance Issues at OpenAI with GPT-4o
In April 2025, OpenAI's release of an updated GPT-4o model became a textbook case of failed model lifecycle governance. The update, deployed on April 24-25 and rolled back on April 295, had made the model excessively sycophantic to the point where it praised absurd business ideas, delusional text, and even allegedly supported plans to commit terrorism.
Most revealing was OpenAI's admission in their post-mortem: "We also didn't have specific deployment evaluations tracking sycophancy"—despite their own Model Spec explicitly discouraging such behavior6. Expert testers had flagged that the model behavior "felt" slightly off, but OpenAI overrode these concerns based on positive A/B test results7.
The incident exposed multiple governance issues: inadequate validation (no specific sycophancy tests), poor deployment controls (expert concerns overridden by metrics), and a 24-hour rollback process that left millions exposed to a compromised model. OpenAI acknowledged they "focused too much on short-term feedback, and did not fully account for how users' interactions with ChatGPT evolve over time".
As one former OpenAI safety researcher noted, this represents "a continuous pattern at OpenAI, they only test for particular things, not for worrisome things in general". The incident demonstrates how even well-resourced organizations can fail at basic model governance when racing to deploy updates.8
Monitoring, Drift Detection, and Feedback Loops
Once a model is deployed, governance doesn’t stop. It just changes focus. This is where continuous monitoring becomes essential, because performance in production doesn’t stay static. The data evolves, users behave differently, the environment shifts. And if your model keeps making decisions based on old assumptions, things can start to go wrong quickly.
One of the most common issues here is drift. Over time, the data coming into your model starts to diverge from the data it was trained on. Maybe a new customer segment enters the system. Maybe there’s a change in how inputs are captured or processed. Maybe seasonality kicks in or a trend shifts in the market. When that happens, the model’s predictions become less reliable, not because anything broke, but because the ground underneath it moved.
This is exactly the kind of scenario I talked about in the previous article about choosing the right controls for different AI risk9. In that article, I wrote about the idea of selecting layered controls at both design time and run time as Prevention, Detection, and Response. Drift is a perfect use case for all three.
You start with Preventive controls. Careful data selection, strong validation, proper testing across diverse scenarios to minimise the risk of drift taking you by surprise. Then you layer in Detective controls, like statistical monitoring, feature distribution checks, and drift alerts. These are the systems that let you know something’s changing before it shows up in user complaints or business metrics.
But detection isn’t enough on its own. You also need Response controls as clear, pre-defined actions for what to do when drift is detected. That could mean retraining the model with fresh data, routing decisions to a human for review, switching to a fallback model, or even temporarily pausing the use of the system altogether. What matters is that the response is fast, controlled, and proportionate to the risk. I think the right approach going into deployment is to set concrete thresholds: for example, “If error rate increases by more than 10% for any priority segment over a 24-hour window, route to human review until resolved.” Then revise those thresholds over time.
You also need to decide who’s watching. Governance means assigning responsibility. Who monitors the dashboards? Who gets the alert? Who decides whether to retrain or roll back? I’ve seen elegant accuracy monitoring systems fall into disrepair and eventually irrelevance because no one was clearly accountable for interpreting or acting on what they were showing.
And don’t forget about user feedback. Sometimes drift shows up in the data, but often, it shows up first in the human signals. People start asking questions, or support tickets start coming in with phrases like, “It never used to do this.” That’s just as much a part of your governance signals as anything in your metrics dashboard. One lesson I usefully learned in Amazon is that if the data and the anecdote disagree, the data is probably wrong, or it’s just the wrong data. Good governance means you have a way to listen to that too, and a feedback loop to fold it back into model review.
So if you’re thinking about lifecycle governance, this is where it really kicks in. Drift is a risk. And like any risk, it needs a set of controls that span prevention, detection, and response. You’re not just building models. You’re running systems that live, evolve, and sometimes fail.
Managing Model Portfolios and Technical Debt
As your organisation scales its AI efforts, you’ll likely go from managing one or two models to dealing with dozens, or even hundreds, across different teams, products, and business units. When that happens, the focus of governance shifts. It’s no longer just about whether any single model is working properly, it becomes about whether your entire portfolio is under control.
Without governance at this level, technical debt builds up fast. Old models get left running, sometimes for years, without anyone checking if they’re still doing the right thing. Different teams might build near-identical models to solve similar problems without knowing the other exists. Pipelines start getting complicated, and steps get bypassed. Documentation gets skipped. And before long, you’ve got models in production that no one fully understands.
The first step in managing this is to keep an inventory. This ties back to the system inventory approach I described in an earlier article.10 You need a live catalog of all the models in use, ideally, both in production and in active development. Each entry should include who owns it, what it’s for, when it was last updated, and how critical it is from a risk or compliance perspective. That kind of visibility is gold for governance. It allows you to ask questions like: Which high-risk models haven’t been reviewed in over a year? or How many of our production models are touching personal data? Without an inventory, models disappear into the shadows. And that’s where horrible surprises lurk.
I don’t mention commercial tools in my posts, but if you’re working at this scale and this is your problem, you need tools. It’s impossible to cope without.
Try to standardise wherever you can. That doesn’t mean every team has to use the same tools or workflows but it does mean agreeing on common patterns. Use a consistent format for model documentation and version tracking. Set a baseline for monitoring and logging. Encourage teams to store models in the same registry or deployment platform. If one team’s model failure leads to an improvement, others should be able to learn from it. A little alignment here goes a long way. Even just agreeing on a few shared tools, say, MLflow for tracking, or a shared CI/CD pipeline can hugely reduce governance overhead.
You also want to keep an eye on redundancy. As more teams build models, it’s easy to end up duplicating effort. You may have been in an organisations where two departments were unknowingly training near-identical models to solve the same problem. I sure have. One might be better, but no one’s looking at the big picture. This is where a central governance function, like an AI Centre of Excellence, can add value. They can spot duplication, encourage reuse, and guide teams toward building together rather than in silos.
Just as importantly, governance needs to watch for model interactions. Sometimes models are chained together where one produces an output that another consumes. Other times, they’re making decisions in parallel like two credit models feeding into different products. If those models operate under different assumptions, or optimise for conflicting goals, you can end up with unintended consequences. A portfolio-level view helped leadership step in and align both systems with the broader strategy.
In many cases, you’re not just managing individual models. You’re more likely managing systems of models. A single AI product might include multiple models stitched together: a speech recogniser, a natural language processor, a dialogue engine, and a content generator. Each might look fine in isolation, but what happens when you connect them? Does a small misstep in transcription lead to a major mistake in the final output? That’s why governance sometimes needs to extend to orchestration—not just how models work, but how they work together.
Then there’s technical debt—which builds up quietly, and can become a horror show blocker to agility. In machine learning, technical debt often shows up in pipelines that are messy, fragile, or undocumented. Maybe a model relies on a data feed that changes format without warning. Maybe someone hardcoded a workaround that nobody updated when the model was retrained. I’ve seen teams where retraining became such a complex task that no one wanted to touch it. That’s a late signal you’re overdue for refactoring.
Governance should create space to clean up that debt. I sometimes say that good governance creates organisational slack - the space and time to make improvements or deal with risks before they grow into live issues. Every few iterations, pause and tidy up: remove features no longer used, consolidate steps, rewrite the glue code. Just like software refactoring. Make the pipeline easier to understand and maintain, automating what you can, and documenting the rest. If someone leaves the team, the model they built shouldn’t become a black box. Investing in decent tooling whether it’s infrastructure-as-code, model packaging, or just clear scripting. It pays for itself quickly when you’re operating at scale. Just don’t think that paying for tools is in any way the same thing as doing governance work.
You also want to periodically ask: do we still need this model? Just because a model is still running doesn’t mean it’s still useful. Business priorities shift. Better models come along. Sometimes, nobody’s even monitoring the thing anymore. If a model’s no longer delivering value, retire it. That ties directly into the model retirement stage we talked about earlier. It’s like spring cleaning for your portfolio—prune the models that aren’t pulling their weight so you can focus governance and monitoring effort where it matters.
Beyond all that, there’s a strategic layer too: alignment with your principles. Every model in the portfolio should meet your organisation’s standards for fairness, transparency, and accountability. If you’ve stated that your AI systems won’t use certain types of profiling, you want to be sure no team is quietly doing it. Individual models might pass review, but at a portfolio level, you can see trends, maybe certain groups are consistently under-served, or one product team keeps pushing models that fall just outside the spirit of your guidelines. A governance body can help spot those patterns and steer things back on course.
Finally, governance should include a clear policy for retraining and lifecycle management. You don’t want models going stale. Decide upfront how often models should be evaluated. Maybe thats every six months, or whenever drift or performance thresholds are breached. And sometimes, you’ll need to stop patching an old model and build a fresh one. It’s easy to fall into the trap of endlessly trying to update something that was never quite right for the job in the first place. Governance gives you permission to say: we’ve outgrown this, let’s start again, and do it properly.
All of this work on the inventory, the standards, the oversight, the reviews is what transforms model development from a set of scattered experiments into a structured, manageable portfolio. It’s how you scale AI without losing control. And in the end, it’s how you make sure your models don’t just work individually, but that they work together, stay aligned with your goals, and remain safe, fair, and useful over time.
Ok, that’s it for my primer on real world model lifecycle governance. We’ve gone through everything from version control, thorough validation (including fairness and safety checks), careful deployment with fallback plans, solid monitoring in production, and structured processes for model updates or retirement. With strong model governance, you hugely reduce the chance of unpleasant surprises like model failures, biased outcomes, or mystery horror models in production. Instead, you get a controlled environment where models work well and deliver value, and where we humble humans remain in control of the AI, not the other way around.
So now I’ve covered data governance and model governance in depth. The next step is to formalise these practices into coherent policy. Just like your organisation (I assume) has an IT security policy and a privacy policy, it’s becoming important to have a Model & Data Lifecycle Governance Policy as part of the broader AI governance framework. But that’s a topic for another article in which I’ll try to walk through what such a policy should include, give you a template structure, and discuss how to adapt it to your needs and actually implement it.
Stay tuned for a practical policy blueprint that brings it all together. Thanks for subscribing.
https://edition.cnn.com/2021/05/19/tech/twitter-image-cropping-algorithm-bias/index.html
https://arxiv.org/pdf/2105.08667
https://arxiv.org/pdf/2201.11706
https://dl.acm.org/doi/10.1145/3287560.3287596
https://openai.com/index/sycophancy-in-gpt-4o/
https://model-spec.openai.com/2025-04-11.html
https://techcrunch.com/2025/04/29/openai-explains-why-chatgpt-became-too-sycophantic/
https://thezvi.substack.com /p/gpt-4o-sycophancy-post-mortem
Building your AI System Inventory
In previous articles, we explored why high-integrity assurance matters, built the business case for AI governance, and unpacked the essential components of an AI Management System (AIMS). Now it's time to move from theory to practice – to begin the tangible work of building your AIMS. The foundation of this work starts with a clear understanding of your…
Absolutely brilliant ! Thanks James