AI Lifecycle Governance

How to embed the mechanisms for safety and trustworthiness into every phase of your AI System’s lifecycle – from concept to operation to retirement.

Apr 22, 2025

Most AI failures don’t happen overnight—they brew over time due to unnoticed issues in data or models. Just think about the online real estate company Zillow, who lost over $500 million when its house-price prediction model went off the rails.

Hard lessons from 2019's Zillow fiasco: How homebuyers and sellers can ...

In 2021, Zillow—a company known primarily for real estate listings—found itself in the middle of an AI governance cautionary tale (not a fun place to be). At the heart of it was a bold bet on predictive modelling: the company’s Zillow Offers program used machine learning to estimate the value of homes and then purchase, renovate, and resell them for a profit. In theory, this “iBuying” model would allow Zillow to act as both a platform and a market participant, using its AI to profit from precision pricing in the property market. At its peak, Zillow was purchasing thousands of homes a month, confident that its valuation algorithm could navigate the complex, localised fluctuations of the housing market. Investors were impressed - Zillow’s share price more than tripled in just a single year up to February 2021.

But beneath the surface, cracks were forming. Zillow’s AI models—trained on years of historical housing data—began to misfire as market conditions shifted. The pandemic had brought unusual volatility to housing demand and supply, and subtle changes in buyer behaviour and local inventory were not fully captured in the training data. As a result, the model started to overestimate property values. Zillow began buying homes at inflated prices, assuming they could be resold at a margin that never materialised. And because this wasn’t just a recommendation engine, but a system tied directly to large-scale capital deployment, the consequences were real and rapid.

By late 2021, the problem had grown too large to ignore. Zillow was sitting on thousands of properties it had purchased at prices above what the market was willing to pay. The financial losses mounted. In November, the company announced it would shut down the Zillow Offers program entirely and lay off 25% of its workforce. They wrote down more than half a billion dollars in losses from the venture, and the share price gave up all the gains of the prior two years and then some. CEO Rich Barton admitted that the company had overestimated the predictive power of its algorithms, saying: “We’ve determined the unpredictability in forecasting home prices far exceeds what we anticipated.”

The story of Zillow isn’t just about a failed product though—it’s about the risks of treating models as static assets in a dynamic world. The algorithms weren’t malicious or wildly incompetent; the problem was more subtle. The model worked—until it didn’t. And when it began to drift, there wasn’t enough governance infrastructure in place to detect the changes in time or to apply meaningful human oversight. No formal guardrails triggered alerts. No systematic drift detection halted purchases. No risk control thresholds forced a review of assumptions. The model had been deployed as if it could run on autopilot—an expensive lesson in what happens when you operate without robust lifecycle governance.

This is the painful lesson at the core of my next set of articles, put simply: if you don’t govern your models and data throughout their lifecycle, small problems can compound into huge failures.

What is data and model lifecycle governance?

In essence, this is all about embedding oversight and assurance checks into every phase of your AI system’s life—from initial development and deployment to continuous monitoring and eventual retirement. It’s the practical side of AI governance that ensures what you build (the model) and what you build it with (the data) remain trustworthy and aligned with your goals over time. If traditional software has a development lifecycle for code, AI adds two extra cycles running in tandem: a data cycle (where data is collected, refined, and evolves) and a model cycle (where models are trained, evaluated, and updated). This has come to be known as MLOps - the discipline of applying DevOps to machine learning. If you want a broad foundational primer on MLOps, then I highly recommend the content on ml-ops.org,1 but I’m going to focus specifically on the assurance aspects of an MLOps-based practice.

These practices are important because AI systems are never truly “finished.” A model that performs flawlessly in the lab can degrade in production as real-world data drifts away from the training set. Without ongoing monitoring, even well-trained models can “drift” and produce unwanted results— I wrote about drift detection in a previous article as an important type of risk to plan for2. Data isn’t static: new data arrives, source systems change, and data quality can fluctuate. If you’re not governing these changes, you’re essentially flying blind, so instead you need to set up processes to regularly check and re-check that the data and models behave as intended, rather than treating validation as a one-time checkbox.

We have to expect a mature AI governance implementation to define criteria and mechanisms for each of these cycles—making sure that data meets quality and ethical standards and that models meet performance and safety standards. To get to how, let me first try to break down the AI lifecycle and talk through how we to embed these governance mechanisms at each step:

1. Data Acquisition & Preparation

Governance starts even before a model is built, starting by describing and enforcing criteria for acceptable data sources and datasets—Was the data collected legally and ethically? Do we have the right permissions? etc

For any dataset to be used in training, you need provenance information (where, when, and how it was collected) and vet it for issues like personally identifiable info, harmful content, or intellectual property that shouldn’t be there. Clear data quality standards (e.g. accuracy, completeness) need to be met. This is also the stage to document the dataset (e.g. creating a Dataset Card like we discussed in a previous article3 – one that captures its origin, contents, and any limitations) and to flag any biases or gaps early. You’re basically at this stage setting up the mechanisms to start treating data with the same rigour as code.

2. Model Design & Development

Then as you build models, you want to set practical guardrails on experimentation. Teams should stick to version control practices not just for code but for training data and models as well. Every model training run should be reproducible—if someone else retrains with the same data and parameters, they should get the same result (at least approximately). Here’s where ethical guidelines start getting factored into design: for example, if certain model behaviours or uses are off-limits, say, using protected attributes like race as inputs, those are clearly communicated and enforced from the start.

This is about treating model development as a disciplined process, not a research free-for-all. Some organisations establish peer reviews or checklist-based approvals before a model can leave the “lab” environment, and a set of initial validations that need to happen. For example, apart from functionally optimising for accuracy, characteristics like fairness and robustness should be tested. These might include checks on how the model performs across user groups, and how it handles edge cases. Any immediate issues uncovered get addressed ‘in the lab’ before proceeding to the engineering process of embedding a model into software for deployment.

3. Validation & Testing

Before a model hits production, you want to have mechanisms for rigorous testing and documentation. This often means evaluating the model on a validation dataset that was kept separate from training data to get an unbiased performance estimate. It could also mean performing stress tests and adversarial tests – for example, checking if a slight change in input causes wild swings in output, or if the model might output any disallowed content.

This is much more than just testing for raw performance or accuracy; we’re testing here for ethical and safety criteria too. Does the model output reflect any bias? If it’s a credit decision model, are we sure it’s not unintentionally redlining by neighbourhood as a proxy for race? A model should only “graduate” from this phase after passing all these checks.

Many teams use a governance review or approval gate here: a formal sign-off that the model meets the organisation’s standards. It can be the best point to insist on documenting results in a Model Card, which captures the model’s intended use, performance across various conditions, and known limitations. By having this thorough validation step, we set a high bar the model must clear to be deployed.

4. Deployment & Release

The deployment itself needs to be governed—and this is where change management really matters. You don’t just toss a new model into production and hope for the best. Instead, you need a controlled process that gives you visibility, traceability, and a safe way to back out if something goes wrong. For critical systems, that might mean releasing the model to a small percentage of users or into a limited beta environment first. This is often called a canary deployment, and it lets you observe how the model performs in the real world before it has full impact. But it’s not just about watching metrics—you also need a clear rollback plan. If things start looking off, you want to be able to revert quickly to a previous, known-good version without scrambling to rebuild what came before.

But here’s the important part that often gets missed: what you’re deploying isn’t just a model. The thing going into production is a full AI system. That includes the model artefact, yes—but also the code that wraps around it, the interfaces that deliver outputs to users, the data pipelines feeding inputs, the background agents acting on decisions, the infrastructure serving it all, and the networks it runs across. And layered on top of that are your users—whether internal teams or external customers—who rely on the system to behave in a certain way. All these components are interdependent. A seemingly small change in one part—a new feature in the UI, an updated input field, a tweak to the model pre-processing—can cascade through the system and cause unintended consequences elsewhere.

That’s why deployment governance needs to treat the whole thing, not just the model, as the unit of change. You might need to coordinate updates to the data pipeline, review changes to how results are presented, and validate that any agents or downstream systems can interpret the model’s outputs correctly. A model might pass all the tests, but a change in the output format could break an integration downstream and silently caused bad decisions. These kinds of failures don’t come from the model—they come from treating deployment as a technical handoff of components instead of a system-level event.

For higher-impact deployments, I always recommend building in an explicit approval step. A simple go/no-go review—where the technical lead walks through the validation results, the risk profile, and the deployment plan—can make all the difference. I’ve worked with teams who brought in their ops lead, their product owner, even legal or compliance for these reviews, depending on the context. It’s not about adding bureaucracy. It’s about making sure everyone understands what’s being deployed, what could go wrong, and what safeguards are in place. And just as importantly, you want to document it: what model version you’re releasing, what data it was trained on, what environment it’s running in, who approved it, and when.

The key is traceability. Everything going into production—model, data, code, decisions—needs to be versioned and traceable, not just so you can troubleshoot, but so you can learn, improve, and if needed, defend your choices. This is what turns deployment from a risky push into a confident, managed release. You never want to find yourself in a situation of simply crossing your fingers and hoping that a live production deployment works (though in truth, it’s only human to be nervous).

5. Monitoring & Maintenance

Once you’ve deployed your system, kick back, relax. Take a break.

As if! In reality, this is where the real work begins. Post-deployment is when continuous assurance kicks in—because it’s not enough to know the model worked once, in testing. Now you need to make sure it keeps working, in the wild.

You’ll want to make sure your real-time monitoring is working, tracking how the model is performing. Is the accuracy holding steady? Are error rates creeping up? Are there new types of errors starting to show up more often? These are early warning signs that something might be going wrong. At the same time, keep an eye on the data flowing into the model. If the input data starts drifting—say, your recommendation system suddenly gets a wave of customers from a new region or demographic that wasn’t well represented in the training data—that’s a red flag.

This is where drift detection becomes essential, something like a smoke alarm for your AI system. It watches for changes in data patterns or model behaviour and alerts you when something slips outside the bounds of what you expected. If you catch it early, you can retrain or recalibrate the model before it turns into a fire. You’ll want to work to define thresholds—something like, “if accuracy drops below 85% on any critical user segment, trigger an alert and pause automated decisions until we’ve had a look.”

But monitoring isn’t just about numbers. It’s also about listening to people. Users will often spot things before your metrics do—strange behaviours, unfair outputs, decisions that don’t feel quite right. If someone flags something odd, take it seriously. Log the issue, review the data, and investigate.

The point is, model governance doesn’t stop at deployment. It becomes a loop—data leads to model, model leads to deployment, and deployment feeds right back into new data and insights. When you build that loop intentionally, you get continuous learning, faster recovery from issues, and better long-term performance. That’s the difference between building something that works today and building something that keeps working tomorrow.

6. Retraining & Updates

Governing the lifecycle means planning for change—not reacting to it after things go sideways. Models don’t stay fresh forever. Over time, they get stale, their performance drifts, or the world around them simply changes. When that happens, it’s time to retrain. But not in a rushed, last-minute scramble. Retraining should be treated as a deliberate, governed process—almost like a mini-project in its own right.

You start by gathering new training data, and just like before, it needs to meet the same quality and governance standards you originally set. Then you run the training process in a controlled environment, test the new model, and compare it directly to what’s currently in production. That might involve side-by-side A/B testing, or running the new model in shadow mode to see how it performs on real traffic without impacting decisions. Only once the new version proves itself should it be considered for rollout.

Some organisations do this on a schedule—retraining monthly or quarterly. Others take a more event-driven approach, triggering a retrain when they detect drift or see performance drop below a certain threshold. Others when there are new foundational capabilities or features available that they want to use. Either way, the key is not to let outdated models quietly keep running long past their use-by date. The risk there isn’t just lower performance—it’s losing trust, introducing bias, or making decisions that are no longer aligned with reality.

But here’s something that’s easy to miss: just because a model is new doesn’t mean it’s automatically safer. You need to be especially careful that risks you’d already addressed in the previous version—biases, edge cases, odd failure modes—don’t quietly resurface in a different form. I’ve seen this happen when a new model, trained on fresher data, accidentally reintroduces a previously mitigated fairness issue because the underlying patterns changed slightly. Or a new architecture behaves beautifully most of the time but produces strange outputs in edge scenarios that never showed up before. That’s why your validation process needs to include regression testing—not just checking that the new model is better overall, but confirming it hasn’t reintroduced old problems or introduced new ones.

Real-world cases show why rigorous evaluation is non-negotiable. In March 2024, OpenAI faced backlash after users reported that GPT-4’s performance had degraded in tasks like coding and logic compared to earlier versions—a claim supported by a study from Stanford and UC Berkeley that found significant variability in output quality over time4. Similarly, when Google DeepMind updated Gemini to include new capabilities, early testers flagged regressions in reasoning coherence and factual accuracy, prompting Google to delay full deployment until improvements were made. These incidents underscore that even top-tier AI labs aren’t immune to regressions—making thorough pre-release testing and monitoring essential to avoid eroding user trust.

This part of the lifecycle is also where technical debt starts to creep in. As you iterate, you’ll accumulate old model versions, unused data features, half-deprecated pipelines, and quick fixes made under pressure. Without governance, this all piles up and eventually slows you down. That’s why pruning matters. Part of lifecycle management is knowing when to decommission models that are no longer serving a purpose, clean up what’s no longer used, and refactor parts of the pipeline to keep things lean and maintainable. Machine learning systems are notorious for building up technical debt—governance is how you keep it under control, before it buries you.

7. Retirement & Sunsetting

At some point, every AI system or model reaches the end of its useful life. And just like the rest of the lifecycle, that final step needs to be governed too. You don’t just switch it off and move on—you retire it deliberately, with a plan.

That might mean transitioning to a newer model, making sure there’s enough overlap so you don’t leave any gaps in functionality. Or it could mean phasing out a feature entirely. Either way, you need to handle the shutdown carefully. That includes dealing with the data, archiving the historical inputs and outputs, then securely deleting anything sensitive in line with your retention policy. And just because the model’s no longer in use doesn’t mean the records disappear. You still need the logs, the model version, and the data it was trained on—so if someone asks a year or two later, “why did the system make this decision?”, you can go back and find the answer.

Retiring a model properly also prevents one of the biggest long-term drags: orphaned systems. These are models or pipelines that technically still live somewhere, but nobody knows who owns them, what they do, or whether they’re safe. A clean retirement process avoids that. It ensures you’ve tied up loose ends—like removing the model from the system inventory, disabling any integrations that relied on it, and updating user-facing tools or documentation.

And then there’s the data. Because once the model’s gone, the question becomes: what happens to everything it used or produced? If that includes sensitive or personal information, you need a plan—whether that means archiving it securely or deleting it entirely. This isn’t just good practice—it’s often a regulatory compliance requirement. In some cases, you may need to prove that the data’s been properly destroyed.

Otherwise, you end up in murky territory. Just ask 23andMe—or more specifically, ask their customers—what happens to personal data when a company disappears or sells off its assets. Genetic data, it turns out, doesn’t just vanish with a bankruptcy filing5. If you're not crystal clear about ownership, consent, and what happens to data after a system is shut down, you risk leaving behind a privacy mess no one wants to inherit.

Retirement might not be the most exciting part of the lifecycle—but it’s one of the most revealing. It shows whether you’ve really built your systems with integrity, or whether they quietly outlived their purpose without anyone noticing.

Wrapping up and What’s Next

When you look at the lifecycle as a whole, it becomes pretty clear: model and data governance isn’t something you do once and tick off. It’s a thread that runs through everything—from first ideas to final decommissioning. It touches everyone involved in building AI: data engineers, scientists, ML engineers, product owners, compliance leads. And if it’s not coordinated across all of them, things fall through the cracks—fast.

One of the most practical ways I’ve found to stay on top of it is to keep a proper AI system inventory. If you’ve read the earlier articles, you’ll know I’m a big believer in thorough inventories. Just list out every AI model, dataset, or interface you have—along with things like ownership, when it was last updated, its current risk profile, and when it’s next due for review. It doesn’t have to be fancy. Even a simple spreadsheet is better than nothing. Because when you know what you’ve got, you can actually govern it. When you don’t, you’re just guessing. Without an inventory, you’ll find your teams trip over forgotten models still running in production, or datasets being reused in ways no one signed off on. An inventory makes that visible—and once it's visible, it's manageable.

It’s also worth saying plainly: lifecycle governance and risk management are two sides of the same coin. Your risk framework sets the direction and creates a mechanism for oversight—what you care about, what you want to prevent. But lifecycle governance is how you actually do it, day by day, decision by decision. It’s where the risks get spotted, tracked, and dealt with before they become problems. Without it, you’re flying blind. You won’t catch a model slowly drifting into bias. You won’t notice that data is being reused inappropriately. And worst of all, if something does go wrong, you’ll have no audit trail, no documentation, and no way to explain how you got there. That’s not a position you want to be in—especially if regulators, users, or leadership come asking.

So here’s how I think about it: lifecycle governance is how you bake quality and accountability into the way your AI is built. It’s the difference between hoping your model behaves and knowing you’ve done the work to make sure it will. Yes, that means setting up the right processes—checklists, sign-offs, version control, all the usual suspects. Yes, it means using tools to help—automated testing, data versioning, drift detection, whatever makes your life easier. But more than anything, it means building a culture where teams actually care about doing this properly. Where high-integrity governance isn’t seen as a roadblock, but as a foundation for building AI that works—and keeps on working.

In the next few parts of this series, I’ll dig deeper into both sides of that lifecycle: first, how to manage your AI data assets with rigour, and then how to govern your models so they stay trustworthy over time. I’ll finish with a full AI Model & Data Lifecycle Policy—a practical template you can use to put these ideas into action. I’m hoping by the end, you’ll have a clear, real-world roadmap for managing the two most important assets in your AI stack: your data and your models.

Do this right, and you still move fast—but contrary to the regrettable Silicon Valley mantra, you’ll be moving fast and not breaking things.

As always, thanks for reading. I’ve come to really enjoy researching and writing these articles. I’m very glad if you find them useful.