Diving into AI Data Lifecycle Governance

Effective practices for data quality, lineage, access control and bias in the data of AI Systems

Apr 29, 2025

One of the most infamous early lessons in data governance came from my former employer, Amazon, and their AI recruitment experiment. Back in about 2014, Amazon built a machine learning system to help screen resumes, intended to surface top candidates faster and reduce the manual load on recruiters. On paper, it sounded like the perfect use case. But the system quietly developed a problem. It started penalising resumes that included the word “women”, as in “women’s chess club captain” or “women’s coding group”. Why? Because the training data, which was comprised of the resumes of successful past hires, was overwhelmingly male. The model wasn’t intentionally biased. It was biased by history. It learned from the past, and the past reflected inequality.1

Amazon caught the issue before the tool was rolled out widely, but the story eventually went public, and it became a kind of parable in AI Safety, quoted everywhere from AI ethics courses to regulatory workshops. It was one of the first stories that came up when I joined Amazon Web Services five years ago. At the time, I was leading the team that worked with regulators in more than 25 countries, working on our understanding of security, resilience and privacy requirements and translating them into engineering and governance approaches for how we built and governed AWS systems. It’s one of the reasons why I use two hypothetical AI systems TalentMatch and Pathfinder in my writing and training courses, and it’s a cautionary tale that still has great relevance today.

The truth is the model itself didn’t fail. The dataset did. The phrase “garbage in, garbage out” isn’t just about messy data—it’s also about blind spots, gaps, and unexamined assumptions baked into a dataset that magnify when the data is used for model training. It might be better to say “bias in, bias amplified”, though I admit, it’s not as catchy!

Data is not just a technical input to AI Systems; it’s a governance domain in its own right. It needs its own controls, its own documentation, and its own way of being audited, challenged, and improved. That’s what I’m going to focus on in this article: diving deeper into practical data governance, building towards policies and practices you can put in place in your own organisation to manage data quality, access control, documentation, retention, and bias right across the AI lifecycle. They include:

Quality and provenance checks that verify accuracy, source legitimacy, and legal rights to use the data
Fairness and representativeness analyses like demographic parity, and equal opportunity metrics that expose structural bias early
Good documentation, including dataset cards, datasheets, and lineage graphs so future teams can audit or reproduce behaviours
Versioning and traceability that tie each model build to the exact snapshot of data it learned from
Retention and deletion rules aligned with privacy law and your schedule for model-refresh, because stale data becomes risky data.

The truth most data scientists seem to learn through hard-won experience is that if your data isn’t managed well, it doesn’t matter how good your model is. It won’t do what you expect, and worse, it might quietly do harm before anyone realises.

Data governance in an AI context means treating data as a critical asset that has to be curated and protected throughout its life. This starts from the moment you consider using a dataset. Whether it’s data you collected internally, bought from a third party, or generated synthetically, you need to ensure it meets your standards and aligns with your system’s goals. It’s not just a one-time vetting either; datasets evolve and so must your governance of them. Let’s start to break down the key aspects.

Ensuring Data Quality at the Source

In October 2020 England’s COVID-19 contact-tracing system under-reported 15,841 positive test results. The root cause was embarrassingly simple: labs delivered daily results as CSV files, but the intake script converted them into an old .XLS spreadsheet that tops out at 65,536 rows. Once that cap was hit, new rows were silently dropped. Contact-tracing calls never went out, and thousands of exposed people kept moving through the community unaware.2

I was living in England at the time, like everyone watching the numbers on the news, and I remember the whiplash when officials admitted the clumsy “technical glitch.” It wasn’t a glitch at all; it was a schema change no one had sanity-checked. A single file-format mismatch had triggered a national public-health incident. Episodes like that make one lesson painfully clear: data quality at source is the first safety gate in AI. If you don’t know exactly what enters the pipeline, including its accuracy, completeness, and fitness for purpose, then every downstream control is a gamble.

Before any feature engineering starts, the dataset should be interrogated with three core questions in mind related to accuracy, completeness, and fit.

Is the data accurate? Look for mis-keyed IDs, botched unit conversions, rogue time-stamps, anything that turns facts into fiction. A payment field logged in cents one month and dollars the next will teach a fraud model catastrophic “rules” that no amount of hyper-parameter tuning can fix.
Is the data complete? Missingness is rarely random, there are usually reasons and they distort reality. If parental-leave status is recorded as periods of extended leave almost exclusively for women, and a promotion-prediction model takes the frequency of leave as a feature (an issue in itself), then it might silently equate motherhood with lower performance. Systematic gaps become spurious correlations that the model will faithfully amplify.
Is the data representative? Check demographic balance, geographic spread, profession and whatever else defines the real-world context. A dermatology image set that is predominantly light-skinned will never diagnose darker skin tones reliably3; a global e-commerce model trained solely on U.S. purchase histories will stumble on Singles Day traffic from Asia. Without coverage, the model’s confidence becomes misplaced certainty.

Answer those three questions honestly and you defuse half the landmines before training even begins. Manual audits don’t scale. Instead, you need to treat validation as unit tests for data. For example, configure Great Expectations or Deequ to profile each batch for out-of-range values, unexpected nulls or drift, then enforce schema contracts that reject any new category or flag more than a specified percent missing labels.

Clean data isn’t always useful data. Every dataset is born in a context, and that context travels with it like an embedded assumption. A perfectly curated US credit-history file may be useless or dangerous, when dropped into a European lending model governed by different credit rules and consumer rights. If the source and scope are murky, then you need to consider either enriching the metadata or discarding. Extra or miscalculated columns invite spurious correlations that a model will dutifully exploit.

A well-known illustration of spurious “signal” comes from JPMorgan’s 2012 London Whale debacle. Analysts later discovered that a key Value-at-Risk spreadsheet, used to set trading limits on the bank’s credit-derivatives portfolio, contained a manual formula that divided by the sum of two daily risk numbers instead of their average. The error halved the model’s risk estimate, green-lighting positions that eventually lost more than $6 billion. Stress-testing or hyper-parameter tuning could not have prevented the failure. The flaw was not in the code but in the data itself. 4

Even perfectly accurate data is unusable if collected without proper consent. When H&M reused years of employee call-centre notes for analytics, regulators imposed a €35.3 million GDPR fine because the original purpose did not cover large-scale profiling. That case shows that once you change a dataset’s use case, you must re-verify its rights.5

You might have the perfect dataset for the problem. But if it was scraped without permission, includes personal data without consent, or violates someone’s rights, it’s not usable. Every dataset you bring in should come with a quick checklist. Can we use it commercially? Can we share it? Does it contain personal information, and if so, do we have a lawful basis for processing it? Was it collected in a way that respects the people it came from?

This is especially important with third-party data sources. Just because you’re buying a dataset, or accessing one from a “reputable” source, that doesn’t mean it’s clean. Don’t assume they’ve already done the hard work. You need to check for yourself. Check whether the dataset contains undisclosed personal data, offensive or harmful content, or copyrighted material scraped without authorization. Failure to do so can expose you to infringement claims and unpredictable model behaviour. These things have a way of surfacing at the worst possible time, often long after the model is in production.

Open-source does not mean risk-free. In 2023, investigators discovered that a popular dataset used for generative images called LAION-5B included copyrighted art and private medical photos scraped without consent.6 Teams who assumed Creative-Commons licensing eliminated the risk had to purge pretrained models and retrain from scratch.

Documentation and Lineage: Telling the Data’s Story

When an AI system is criticised for bias or drift, the first question is usually, “What data did it learn from?”, and the second is, “How has that data changed since it was sourced?” You can only answer if the dataset’s history is written down. Without documentation, teams rely on half-remembered Slack threads or the lone engineer who “just knows.”

A practical fix is to treat each significant dataset like a component with its own spec sheet. Timnit Gebru and colleagues formalised the idea in their “datasheets for datasets” proposal: a one-pager that records the source, collection method, contents, intended uses, known limits and legal status of the data7. A well-filled datasheet lets future teams see, at a glance, whether the dataset is still a safe fit. If a voice model starts mis-transcribing speakers from Dublin, you can check the sheet and discover the training recordings came almost entirely from London call-centres, a clue that saves days of blind debugging.

The content of data is only half the story. Lineage shows where the data has travelled: when the raw logs were pulled, the script that stripped personal identifiers, the balancing step that down-sampled the majority class, the final split that fed into the model. Good lineage captures cause-and-effect; when a bug emerges in one of those steps, you immediately know which models and dashboards to revisit.

Beyond preventing errors, documentation also drives reproducibility and knowledge sharing. When one data scientist leaves, their documented datasets and pipelines remain for the next person to understand and build upon, reducing institutional knowledge loss. It’s helpful to integrate documentation steps into the lifecycle: for instance, during the plan and acquire phase, decide what metadata to capture; during the processing phase, document any transformations or feature engineering applied; and when data is arvhived, include notes on why and how it was archived. Many organisations adopt data version control tools to centrally store dataset documentation, lineage, and usage logs, so anyone can discover and learn about available data. The key is to treat documentation as an ongoing part of data’s life, not a one-time chore.

Sidebar – Data Version Control Tools & Practices:
Keeping track of data versions is a practical aspect of documentation and governance. It ensures you know which version of a dataset was used for a given analysis or model, and lets you reproduce results or roll back to earlier data if needed. A number of non-commercial tools have emerged to help with this:
DVC (Data Version Control)8: An open-source tool that brings Git-like versioning to data science. DVC lets you snapshot and version large datasets and models, storing pointers (metafiles) in Git while the actual data can live in cloud or on-prem storage. This creates a single, traversable history of data, code, and models, so you can identify exactly which data set and parameters produced each result. Switching between dataset versions becomes as easy as checking out a Git branch, enabling reproducible experiments.
Delta Lake (Open-Source Usage): An open-source storage layer (originally from Databricks) that enables ACID transactions on data lakes. One of its powerful features is built-in “time travel” queries on your data. Every change (batch update, delete, etc.) is tracked, so you can easily roll back to a previous state or reproduce a past report on the exact snapshot of data used. This is really useful in data governance. If a mistake is introduced in data processing, you can revert to a specific state, and auditors can see how data evolved over time.9

Keeping such a tidy paper trail is no longer optional hygiene. Article 10 of the EU AI Act requires every high-risk system to prove the quality, relevance and representativeness of the training, validation and test data it uses, and Article 11 demands that technical documentation is kept up to date and available for regulators to inspect10. Similar rules are taking shape in Canada and several U.S. states, so do your future self a favour by having a robust lineage graph that might help turn an anxious compliance scramble into a two-minute export.

And don’t underestimate the power of documentation as a form of control. It forces clarity. If you can’t clearly describe what’s in the dataset and why it’s being used, maybe you shouldn’t be using it. The very act of writing a Dataset Card can make us stop and realise that the data doesn’t actually fit the use case—it had been collected for a different purpose, under different assumptions, and using it would’ve been a governance problem waiting to happen.

Well-documented data shortens every feedback loop: incident response, feature engineering, fairness reviews, even board-level risk reporting. Undocumented data, by contrast, leaves your models with ghosts of hidden assumptions, forgotten filters, silent schema shifts.

Access Control and Data Security

Not everyone inside an organisation, let alone outside it, should have unfettered access to AI data. Effective governance therefore embeds robust, least-privilege controls so that only authorised people or services can read, write, or copy sensitive datasets. The risk of a data breach from an outsider attacker is always a top concern, but overly permissive sharing built into the AI system itself can be just as bad. Think about the Cambridge Analytica scandal, a seemingly harmless quiz app drew profile data on up to 87 million Facebook users because internal API permissions were set far too broadly, resulting in a settlement that cost $725 million.11

Not everyone in your organisation, or outside it, should be able to retrieve or modify every AI dataset. Effective governance embeds least-privilege controls so that only authorised individuals or systems can access sensitive data. In practice, this involves defining clear role-based or attribute-based access rules in your storage systems. For example, production customer records used for model training might be readable only by the data engineering team and a named research team after formal approval. When data domains are especially sensitive (for instance, medical records or financial transactions), you can require additional safeguards such as anonymisation/pseudonymisation, specialised training, signed data-handling agreements, or access only within an isolated secure environment.

A lightweight data catalogue or registry provides visibility into who owns each dataset, who has current access rights, and under what conditions those rights were granted. Dataset stewards review new access requests against corporate policies.

Segmentation can further reduce the risk by separating raw, identifying data from modelling environments. In a common pattern, you keep sensitive data in an encrypted “quarantine” bucket and expose only anonymised or aggregated views to data scientists. Encryption at rest and in transit, combined with audit logging of every query and file access, creates a strong barrier against accidental leaks. Unusual activity, such as for example, an excessively large data extract, can trigger automatic alerts for rapid investigation. The 2019 Capital One breach, which exposed data belonging to over 100 million customers due to a misconfigured cloud storage policy, is a stark reminder of the importance of tight configuration and monitoring of data access.12

A frequent temptation in AI projects is the “quick local CSV,” when an analyst downloads data for convenience and stores it on an unmonitored laptop. To prevent this, you may want to provide secure sandbox environments equipped with the necessary tools and libraries so analysts never need to remove data from the controlled system. You can perhaps think about incorporating data-handling rules into onboarding—for example, “do not store dataset X on local machines” and “use only platform Y for model development”—and reinforce them with regular training.

While it’s critical to restrict access, you also can’t let governance become a bottleneck. Pre-approved role bundles (for instance, “feature engineer” or “model author”) streamline common workflows, granting necessary permissions automatically when a project starts and revoking them when it ends.

Finally, access control extends to data retention and disposal. Holding sensitive data indefinitely “just in case” increases breach exposure and may violate privacy regulations. You need to have defined retention schedules, like archiving or deleting training snapshots 90 days after model deployment and automate secure deletion or cryptographic key shredding. This ensures that once data outlives its purpose, it truly disappears and is no longer an attack surface.

Bias Detection and Mitigation in Datasets

Bias is one of the thorniest problems in AI, and it usually starts with the data. If your dataset is skewed or unbalanced, your model will be too. It’s not enough to assume or hope your data is fair. You need to check. And you need to be able to show your work.

That starts by building bias detection into your data review process. Before you use any dataset for training, look at how well it represents the different groups or segments your model is going to impact. If you're building a lending model, that might mean checking how the data breaks down by race, gender, income, geography. Are certain groups underrepresented? Do the outcomes in the data reflect past discrimination, for instance, some groups being systematically denied loans? Because if those patterns are in the data, the model will learn them unless you intervene.

There are tools out there that can help with this, including open-source libraries from IBM, Google, and others that compute fairness metrics on datasets. But you don’t always need something complex. Even simple stats can tell you a lot. What’s the average outcome for each group? What’s the ratio of positives to negatives? Are there any attributes acting as proxies for sensitive variables?

Once you’ve identified bias in the data, the next step is to do something about it. That could mean pre-processing the dataset, possibly balancing it by under-sampling the majority group or over-sampling the minority group. You might generate synthetic examples to help cover gaps, or apply re-weighting techniques so the model pays more attention to rare cases. Sometimes you can fill the gaps by collecting more data, although that’s not always realistic. The point is, there’s no single right way to fix bias. And whatever approach you take, document it. If you under-sample or filter the data, make a note in the Dataset Card. If you removed a field because it was too tightly correlated with a protected attribute, say so. Transparency here builds trust and helps everyone down the line understand the trade-offs if they need to be explained after the fact.

Another part of this is setting clear thresholds for what’s acceptable. In a perfect world, there’d be zero disparity between groups. But in the real world, you often need to define what level of bias you’re willing to accept, and what crosses the line. Maybe you decide that each key group in your user base should be represented in the training data at no less than 80% of their prevalence in the real population. Or that any performance gap greater than 5% between groups needs to be flagged and investigated. Whatever you decide, document it and treat it as part of your risk framework.

You can’t rely solely on automated checks. Tools are great, but they don’t replace judgment and the fact that what might appear as unbiased from a purely technical perspective might become unfair when the social context is considered. Involving domain experts, especially those with lived experience or a deeper understanding of systemic bias, can surface issues that metrics alone won’t catch. Some biases are subtle, historical, or cultural. You won’t always spot them in the numbers, but different perspectives can help them become clear.

Now, what if you find bias in your data but can’t realistically fix it? Maybe you don’t have access to better data, or collecting it would take months you don’t have. That’s where governance needs to step in with constraints. If your training data for a resume-screening tool is overwhelmingly male, and you can’t correct that, then you shouldn’t be using the model to make automated decisions about hiring. Maybe you use it only to support human reviewers. Maybe you include a disclaimer in the documentation like “this model was trained on data with X bias and should not be used in scenario Y.” The key is to be transparent about the limitations, and to put boundaries around how the system is used. It’s also a point where data governance and model governance need to work together. If biased data slips through, model validation should catch it and trigger a rethink of how that data gets collected and cleaned next time.

One final note on synthetic data. It’s often pitched as a magic bullet for bias. And it can help especially if you need to generate examples for underrepresented groups. But you need to treat synthetic data with the same care as real data. Document how it was generated. Validate that it’s helping, not introducing new problems. Ideally, synthetic data should mimic the structure and distribution of real-world data but with specific improvements, like balancing group representation. But don’t assume it’s automatically safe. Use it carefully, and test whether the model trained on it still performs well in real-world scenarios.

Bias is hard. But ignoring it is worse. And with the right data governance practices in place, you can at least make sure your AI systems are learning from the right patterns, not just repeating the wrong ones.

In the next article, I’ll turn to similarly exploring model governance in depth. I’ll share some practical advice on versioning models, validating performance, monitoring drift, managing access, and ensuring ongoing fairness and safety once your models are in production.

Thank you for reading. I’d love to hear your thoughts on this article: what resonated, what you’d like more detail on, or any examples from your own experience. If you found this guide useful, please subscribe to stay updated on future instalments and practical tips for governing AI responsibly.

Doing AI Governance

Discussion about this post