Cognitive Calibration

Keeping human judgement sharp through a culture of disciplined skepticism in high-stakes, AI-powered decision making.

and

Sep 17, 2025

Acknowledgment: Thank you to Alberto Chierici for reviewing and contributing to this article. I highly recommend his blog on the intersection of AI innovation and meaningful human oversight:

https://honestai.substack.com/s/human-override

At 2:10am on June 1, 2009, alarms sounded in the cockpit of Air France Flight 447 as it flew 37,000 feet over the South Atlantic from Rio de Janeiro to Paris. The alarm signalled that the autopilot had disconnected, reverting to manual pilot control. The experienced pilot, accustomed to autopilot flight at this altitude, instinctively responded by pulling back on the stick, initiating a climb. Airspeed readings were confused but altitude remained at around 37,000 feet before slowly falling. The pilot pulled further back on the stick in an attempt to regain altitude. What the pilot didn’t comprehend at that moment was that ice crystals had blocked an instrument for measuring airspeed called a Pitot tube, causing inaccurate readings. The autopilot had handed back control to a startled flight crew who still assumed that the airspeed readings could be trusted1.

As the angle of attack continued to rise, airspeed dropped and the aircraft entered an aerodynamic stall. It was a serious condition, but one from which recovery should have been straightforward for any competent pilot. Stall alarms continuously sounded for 54 seconds, yet flight recordings indicate that the pilots never recognised they were in a stall, nor initiated the standard stall recovery manoeuvre of nose-down, power up. In four minutes and twenty-three seconds, one of the world's most advanced airliners, one that was airworthy and crewed by experienced pilots, fell from the sky with the loss of 228 lives.

The Air France 447 crash was certainly preceded by the mechanical failure of the airspeed instrument, but that was not the cause. Instead, the recorded cause of the crash was pilot error, due a phenomenon psychologists would call "automation bias." It’s the tendency to over-rely on automated systems while our manual skills atrophy. When ice crystals blocked instruments for measuring airspeed, the autopilot disconnected, control was handed back to a cognitively unprepared flight crew who proceeded to make fatal mistakes.

Reading about the tragedy of Air France 447 (you probably know by now that I have a fascination with lessons from disasters), I can’t help translating it to my own work, and a persistently troubling question: How do we maintain human capability and judgment when AI systems handle 99.9% of decisions correctly, but we desperately need human intervention for that critical 0.1%?

I think Air France 447 reveals a disturbing truth: as our systems become more automated and capable, humans become increasingly vulnerable at the moment those systems fail. Our cognitive abilities are not calibrated to reliably understand and act in those moments.

This challenge is not confined to aviation. As AI systems increasingly make consequential decisions across healthcare, finance, criminal justice, and defense, we face the same cognitive calibration problem at massive scale. We need humans who can effectively challenge AI systems when they're wrong, while appropriately trusting them when they're right. I think the solution lies in neither rejecting AI nor perfecting AI, but in systematically training humans to maintain "calibrated skepticism", the capability to challenge artificial intelligence appropriately.

The weird psychology of algorithmic deference

Three decades of research reveals that humans have a complex, contradictory relationship with algorithmic authority. We simultaneously over-trust and under-trust technology in ways that can be catastrophic.

Linda Skitka did some of the most groundbreaking research in automation bias, primarily with NASA, where she identified two distinct failure modes. Omission errors that occur when humans fail to respond to system irregularities because automated alerts don't fire, like missing a typo because spell-check didn't flag it. Commission errors happen when humans follow automated directives despite contradictory evidence, accepting an incorrect spell-check suggestion even when it's obviously wrong2. And contrary to what I assumed and perhaps others would expect, Skitka found that when dealing with automation failures, professional pilots performed no better than students, and two-person crews showed no improvement over individuals, suggesting that expertise and teamwork don't protect against automation bias.3

Berkeley Dietvorst's research reveals another dimension: algorithm aversion. This is the tendency to mistakenly avoid algorithms after seeing them fail. His experiments show that people lose confidence in algorithms 2-3 times faster than in humans after identical mistakes, even when participants observed algorithms outperforming humans consistently4. Why? It seems that people expect near-perfect performance from algorithms while accepting human fallibility, and they believe humans can learn from mistakes while algorithms can’t.

Yet paradoxically, research by Logg, Minson, and Moore demonstrates algorithm appreciation. Participants weight algorithmic advice 5-15% more heavily than human advice in objective tasks5. Humans show algorithm appreciation before errors occur but develop algorithm aversion post-error, creating a weird trust relationship that oscillates between over-reliance and under-reliance.

When explanations make things worse

The intuitive solution to over-reliance seems obvious: make AI systems explainable. If humans understand why an AI reached a particular conclusion, they should be better equipped to evaluate its correctness. Unfortunately, research consistently shows that AI explanations often increase rather than decrease inappropriate reliance.

Microsoft Research's comprehensive 2021 study found that feature-based explanations increased overreliance on AI systems and did not improve decision outcomes6. Users who received explanations were more likely to accept both correct and incorrect AI recommendations, essentially increasing "blind trust" rather than appropriate calibration. Even worse, detailed explanations led to more overreliance than simple ones, and non-informative explanations (like fake accuracy scores as low as 50%) still improved user trust.

This connects directly to Rozenblit and Keil's research on the "illusion of explanatory depth"7. It describes how people systematically overestimate how well they understand complex phenomena. When asked to explain how something works in detail, their confidence drops precipitously. I suspect AI explanations trigger this same illusion: users believe they understand the system globally based on local explanations, creating false confidence in their ability to predict AI behaviour.

DARPA's Explainable AI program, despite investing millions in making AI systems more interpretable, found mixed effectiveness across 12,700 participants8. The relationship between explainability and appropriate trust proved far more complex than initially assumed. Context, user expertise, and task characteristics mattered more than explanation quality. Most concerning, explanations consistently increased reliance on all AI recommendations, correct and incorrect alike, mirroring Microsoft’s findings.

IBM Watson for Oncology exemplifies these dynamics at scale. Despite massive marketing around AI-driven cancer treatment recommendations, internal documents revealed "unsafe and incorrect" suggestions. The system was trained on synthetic patient cases rather than real outcomes data, reflecting the treatment biases of a single institution (Memorial Sloan Kettering) rather than global best practices or local context. Healthcare workers didn't understand these limitations, and the promise of "AI explanations" created false confidence in recommendations that were essentially rule-based expert systems rather than true machine learning. After six years and hundreds of millions in investment, only about 50 hospitals worldwide adopted the system before it was quietly discontinued.9

The cognitive mechanisms behind explanation-induced overreliance are becoming more clear: confirmation bias (users accept explanations that match their beliefs), anchoring effects (AI recommendations become cognitive anchors), cognitive load reduction (explanations provide mental shortcuts), and the illusion of understanding (detailed explanations create false confidence in comprehension).

Building appropriate skepticism

When DARPA's Explainable AI program showed limited progress on technical approaches, their focus shifted to "cognitive forcing functions"—interventions that require users to engage in effortful thinking before accepting AI recommendations. Research by Buçinca and colleagues shows that requiring users to explicitly justify their decisions significantly reduces overreliance on incorrect recommendations10.

Performance pressure can also be a lever. In a low-stakes fake-review detection task with crowdworkers, research found that raising the stakes (via bonuses framed as potential losses) made the reviewers more careful in their use of AI advice. They rejected the advice of AI more often when the stakes were higher11. There might even be ways to use psychology research, adding a ‘creepiness’ factor to our AI systems, triggering our brain’s natural tendency to raise defences and question our perceptions when presented with ‘weird’ or ‘creepy’ sensations12

The Navy's evolution after the USS Vincennes incident provides a template for systematic learning. In 1988, the Aegis Combat System misidentified Iran Air Flight 655—a civilian Airbus A300—as a military F-14, leading to the tragic shooting down of a passenger aircraft13. The crew trusted the system's identification despite the aircraft operating on a civilian flight corridor and climbing rather than descending. The subsequent training overhaul focused on the tendency to fit ambiguous data to expected threat scenarios, a term they coined as ‘scenario fulfillment bias’. Work to reduce this bias consisted mainly of rigorous protocols for manual verification of automated identifications.14

It’s likely that the most critical factor in effective human-AI collaboration isn't technical but cultural. Perhaps similar to safety culture for non-automated systems, organisations may need to create environments where challenging AI systems is both psychologically safe and professionally rewarded.

Amy Edmondson's psychological safety research, which began with a paradoxical finding about hospital error rates, revealed that high-performing teams don't make fewer mistakes—they report more mistakes because people feel safe speaking up.15

Applied to AI systems, psychological safety may mean normalising questions like "Why is the AI recommending this?" and "What might the system be missing?" Perhaps it means treating human AI overrides as valuable data points rather than system failures, human insights that enhance and improve AI performance. We might come to recognise organisations with high AI-related psychological safety as those who see questioning algorithmic recommendations as part of professional competence rather than anti-technology resistance.

I think the mindset of safety engineers, cultivated in our professional education and experiences and embodied in high reliability organisations can be instructive. I was trained to be a skeptic of performance. My undergraduate and early career education was obsessively focused on systems failure, as was the case with many of my peer engineers. We studied research from organisations and industries where failure is catastrophic, like chemical plants (Bhopal and Piper Alpha), nuclear power plants (Three Mile Island), and space flight (Challenger). Research then and since has consistently pointed to five major principles for managing complex, high-risk systems: preoccupation with failure (treating every near-miss as a system vulnerability to learn from), reluctance to simplify interpretations (maintaining diverse perspectives), sensitivity to operations (awareness of system interactions), deference to expertise (decision authority flows to knowledge rather than hierarchy), and commitment to resilience (building capability to respond to unexpected events).16

High-reliability principles can set the right culture, a preoccupation with failure, reluctance to simplify, sensitivity to operations, deference to expertise, and commitment to resilience, but they don’t by themselves tell people what to do in the seconds when an AI-powered automation stumbles and hands control back. And as we’ve seen, the “obvious” remedies keep disappointing at the moment of truth. More explanations don’t work: explanation features often increase over-reliance on AI, inflating confidence without improving judgment. More expertise and more teamwork don’t save us either: experts perform no better than students, and two-person crews no better than individuals, when automation misled them. Meanwhile our trust oscillates—we over-trust algorithms when they’re right and then under-trust them after a miss, producing exactly the wrong pattern of attention when stakes are high. In short, the right culture is necessary, but we also have a cognitive calibration problem.

What we need is some way to coach cognitive forcing functions, explicit challenge prompts, adversarial checks, and scheduled practice that keep human skills warm even when machines do most of the work. Think of the Navy’s post-incident shift to manual verification drills: the point isn’t to reject automation, but to script how humans interrupt it when cues are ambiguous.

Take the historical example of Stanislov Petrov, a Soviet duty officer whose console reported a US nuclear missile launch in 1983. He went against protocol that could have escalated to a nuclear counterstrike because the pattern didn’t fit with his mental model. His domain expertise made him think that a real nuclear first strike would be massive, not a single missile. His engineering training gave him the critical thinking to be aware of unlikely failure modes, and the absence of corroborating data from a ground radar made him distrustful. That sequence of checking priors, discounting for unreliability and waiting for independent verification is precisely the kind of skepticism we need17.

That’s what I’m thinking through with a protocol I call CATCH that can be consistently applied to human oversight of consequential AI decisions: Challenge (make dissent a default move), Assess (read context and uncertainty, not just scores), Test (probe the system with adversarial checks and counterfactuals), Calibrate (tune trust and workflow over time), and Hold (preserve human authority and skill even as automation improves). It’s about building the the muscle memory that keeps judgment calibrated and intact precisely when the system needs a human.

The CATCH Protocol

Challenge: Create organisational cultures where questioning AI recommendations is expected and rewarded. Implement structured questioning protocols like "What might the AI be missing?" and "Under what conditions would this recommendation be wrong?" Establish adversarial processes that systematically probe AI system boundaries. Train teams to recognise cognitive biases that lead to inappropriate reliance on algorithmic authority. Leverage user interface design to signal confidence or invite challenge.

Assess: Develop systematic approaches for evaluating AI confidence and uncertainty. Train users to interpret AI confidence scores in context of task criticality and system reliability. Provide evaluation metrics, explanations and counterfactuals to reviewers. Implement signal detection metrics to measure human-AI calibration. Create feedback loops that help users understand when their trust in AI systems is appropriately calibrated versus over- or under-calibrated.

Test: Deploy "stump the system" exercises that probe AI capabilities through realistic adversarial scenarios. Implement shadow mode testing where AI systems run in parallel with human decision-making for comparison and learning. Create regular "AI-free" periods where humans work without algorithmic assistance to maintain manual skills. Establish systematic red team processes that identify potential failure modes before they occur in critical situations.

Calibrate: Implement graduated trust approaches where new AI systems start with high verification requirements that decrease as reliability is demonstrated. Create structured protocols for human-AI collaboration that specify decision authority and escalation procedures. Develop measurement frameworks that track both human and AI performance over time. Establish regular calibration sessions where AI-human disagreements are systematically reviewed for learning.

Hold: Maintain human skill and authority even as AI capabilities improve. Resist the temptation to automate every decision that AI systems can handle well. Preserve human expertise in domain areas where AI provides assistance. Create organisational memory of AI limitations and failure modes that persists as technology evolves. Establish clear policies about human override authority and the conditions under which it should be exercised.

The CATCH Protocol is about recognising that effective human-AI collaboration isn't about perfect systems but about resilient collaboration, maintaining human effectiveness when AI systems inevitably encounter their limits.

I believe the future of AI governance won't be determined by the sophistication of our algorithms but by the wisdom of our organisational responses to algorithmic power. The Air France 447 pilots weren't incompetent; they were victims of a training regime that failed to prepare them for manual flight when automation failed. Their cognition was uncalibrated. Today's challenge is much broader: preparing entire organisations to maintain human agency and effectiveness in an age of increasingly capable artificial intelligence.

Success means moving beyond the false choice between human judgment and algorithmic efficiency. Instead, we have to build systematic organisational capabilities for cognitive calibration, the ongoing process of maintaining appropriate skepticism toward AI systems while capturing their genuine benefits. The stakes couldn't be higher, but the path forward is fairly clear: training humans not to compete with AI, but to effectively challenge it when it matters most.

Doing AI Governance

Discussion about this post