Moonshot Safety

What the Apollo programme teaches us about AI evaluation — and what Challenger warns us we’re getting wrong

I. The moonshot that worked

Anthropic’s Frontier Safety Roadmap, published in February 2026, describes several of its highest-priority security goals as “moonshot R&D projects.” The term is well chosen, though perhaps not in the way intended.

The Apollo programme remains history’s most instructive example of how to push the boundaries of what is possible while maintaining rigorous safety under extreme competitive pressure. Its lessons are specific, structural, and directly applicable to the governance of frontier AI.

Apollo did not proceed from ambition directly to achievement. Between Kennedy’s 1961 speech and Armstrong’s 1969 footstep, NASA built a sequential testing architecture in which each mission had defined success criteria that had to be met before the next could proceed.

Apollo 7 tested the command module in Earth orbit. Apollo 8 tested lunar orbit. Apollo 9 tested the lunar module in Earth orbit. Apollo 10 performed a full dress rehearsal, descending to within fifteen kilometres of the lunar surface without landing.

No step was skipped. No milestone was redefined after the fact to accommodate schedule pressure.

Independent verification was structural, not aspirational. The contractor who built a system was not the sole evaluator of whether it worked. Separate organisations were responsible for design, testing, and mission assurance.

Review boards had authority to halt progress. Flight directors had abort authority at every mission phase. Testing was adversarial by design: NASA systematically tested failure modes, not just success cases, and every mission had pre-defined abort procedures for every phase.

And documentation was exhaustive and genuinely informative. The failures were documented as thoroughly as the successes, on the assumption that transparency was a safety mechanism, not a liability.

II. The fire that made the moon possible

On 27 January 1967, a fire during a launch pad test killed astronauts Gus Grissom, Ed White, and Roger Chaffee. NASA’s response defined the programme’s character.

They halted Apollo for twenty months. They conducted a thorough investigation. They redesigned the command module. They changed the internal culture around raising safety concerns.

The competitive pressure to continue was intense. The Soviet Union was pursuing its own lunar programme. The political stakes were enormous. NASA chose to stop anyway.

Apollo 1 is not what the Apollo programme is remembered for. It is remembered for Apollo 11. But Apollo 11 would not have succeeded without the pause, the redesign, and the cultural shift that followed Apollo 1. The fire did not define the programme. The response to the fire defined the programme.

The Soviets, by contrast, did not stop after their failures. They kept pushing, cut corners, and never reached the moon.

III. The normalization of deviance

On 28 January 1986, the Space Shuttle Challenger broke apart seventy-three seconds after launch, killing all seven crew members. The Rogers Commission investigation, and Diane Vaughan’s subsequent sociological analysis, identified a pattern that has become one of the most studied phenomena in engineering ethics: the normalisation of deviance.

The O-rings on Challenger’s solid rocket boosters had shown erosion on previous flights. Engineers at Morton Thiokol raised concerns. But because the erosion had not yet caused a failure, it was gradually reclassified from an anomaly to an acceptable risk.

Each successful launch with eroded O-rings made the next launch with eroded O-rings feel safer. The evidence of risk was reinterpreted as evidence of resilience.

The structural causes were equally important. NASA was under intense schedule pressure. The engineers who objected to the launch were overruled by managers responsive to those pressures. Information about the risk flowed upward, but decision-making was shaped by institutional incentives that favoured launching.

Richard Feynman’s appendix to the Rogers Commission report captured the core failure precisely. NASA management estimated the probability of shuttle failure at one in one hundred thousand.

Working engineers estimated it at roughly one in one hundred. Feynman’s conclusion has become a foundational principle in engineering safety: for a successful technology, reality must take precedence over public relations, because nature cannot be fooled.

IV. A model debugging its own evaluation

In February 2026, Anthropic released the system card for Claude Opus 4.6, its most capable model at the time. Section 1.2.4.4 contains a remarkable admission: Opus 4.6 was used via Claude Code to debug its own evaluation infrastructure under time pressure.

The system card describes this as “a potential risk where a misaligned model could influence the very infrastructure designed to measure its capabilities,” and characterises it as “a risk we are actively monitoring.”

No formal protocol was applied to manage this risk. No independent verification of the affected infrastructure components was conducted. No methodology exists — at Anthropic or anywhere else in the industry — for determining when it is permissible for a model under evaluation to participate in building or modifying evaluation infrastructure.

This is not an isolated incident. It exists within a pattern. Anthropic’s Responsible Scaling Policy, first published in September 2023, committed to defining ASL-4 capability thresholds and safeguards before training ASL-3 models. Claude Opus 4 triggered ASL-3 in May 2025. As of April 2026, ASL-4 remains undefined.

External evaluators typically receive days, not months, to assess models before release. The time between model releases has compressed to weeks. And in March 2026, Anthropic confirmed it is testing a new model it describes as a “step change” in capabilities — while its Frontier Safety Roadmap’s most concrete near-term security commitment is to select one to three “moonshot R&D projects” and establish further goals for each.

V. Recognising the pattern

The parallels to the Challenger pattern are structural, not merely rhetorical.

The O-ring erosion was known and documented. The self-evaluation circularity is known and documented. In both cases, the evidence of risk was present in the organisation’s own records. In both cases, the risk was acknowledged but reframed: from an anomaly requiring resolution to an acceptable condition of operation.

Schedule pressure drove Challenger’s launch decision. Competitive pressure from OpenAI, Google, and others drives the AI release cadence. In both cases, the institutional incentives favour proceeding, and the burden of proof falls on those arguing to wait.

Engineers at Thiokol raised objections and were overruled. Anthropic’s own system cards document the concerns of safety researchers, and the releases proceed. The information flows upward. The decisions flow from other pressures.

The RSP commitment to define ASL-4 before reaching ASL-3 was not met. The evaluation timeline has compressed rather than expanded as models became more capable.

Each release without catastrophe normalises the next release without adequate evaluation. This is the normalisation of deviance, visible in real time, documented in the organisations’ own publications.

VI. What doesn’t exist

A survey of the current governance landscape reveals a striking absence. There is no formal protocol, at any AI lab or in any regulatory framework, specifically addressing when a model under evaluation may be used in the evaluation pipeline.

There is no equivalent of financial auditing’s independence requirements, clinical trial blinding protocols, or aviation’s independent verification and validation standards — adapted for AI evaluation.

The EU AI Act formalises evaluation as a legal obligation without specifying its technical meaning. NIST’s AI Risk Management Framework treats evaluation as a core lifecycle activity without prescribing what constitutes sufficient evaluation.

Anthropic’s RSP defines capability thresholds and safety levels but contains no separation-of-concerns requirements for the tools used to build evaluation infrastructure. The EU Code of Practice recommends at least twenty business days for external evaluation, but no company appears to have exceeded this minimum.

Anthropic’s own ASL-4 safety case sketches describe a concept of “trusted models” — models that can be confidently relied upon not to undermine protocols they participate in — but this remains a hypothetical framework, not an implemented protocol. The gap between the aspiration and the practice is not a matter of fine-tuning. It is a structural absence.

VII. From Challenger to Apollo

The AI safety discourse is currently framed almost entirely around preventing disaster. Industry leaders speak of needing an “AI Chernobyl” to catalyse proper regulation. Policy researchers warn of avoiding an “AI Chernobyl moment.” Safety roadmaps are organised around threat models, attack surfaces, and risk mitigation.

This framing misses the deeper lesson of the space programme. Apollo’s rigorous safety infrastructure was not the thing that slowed NASA down. It was the thing that got them to the moon.

The twenty-month pause after Apollo 1 was not a delay to the mission. It was part of why the mission succeeded. The independent verification, the gate reviews, the abort procedures, the culture of hearing engineering objections — these were not obstacles. They were the foundation.

The Soviets did not pause. They did not build independent verification structures. They prioritized speed. They never reached the moon.

Applied to AI, the implication is not merely that the current approach is dangerous. It is that it is strategically wrong. The lab that solves evaluation independence, that defines and implements ASL-4 with genuine rigor, that builds the equivalent of mission-phase abort procedures — that lab does not merely avoid catastrophe.

It builds the kind of justified trust that enables capabilities no one else is trusted to deploy.

VIII. Concrete proposals

The following proposals are drawn from established practices in high-reliability industries and adapted for the specific challenges of frontier AI evaluation.

Evaluation infrastructure independence. No model at or above the capability tier being evaluated should be used in any component of the evaluation pipeline without documented justification, independent review, and verification of the affected components by a separate team using a different model or non-AI methods. This is the direct equivalent of auditor independence requirements in financial regulation.

Sequential gate reviews with halt authority. Modelled on Apollo’s mission progression, each capability level should have defined success criteria for safety evaluation that must be met before proceeding to the next. An independent body — analogous to a flight director — should have authority to halt progression. This authority must be structural, not advisory.

Binding minimum evaluation periods. External evaluators should receive a minimum of sixty business days for frontier model assessment, with the period scaling with capability. This should be a regulatory requirement, not a voluntary commitment, removing the competitive disadvantage of thorough testing.

Mandatory ASL-4 definition with external review. Any lab that has triggered ASL-3 should be required to publish a complete ASL-4 definition — including capability thresholds, required safeguards, and evaluation protocols — within six months, subject to external review and audit. The current situation, in which a lab can operate at ASL-3 indefinitely while ASL-4 remains undefined, creates an open-ended permission to scale without constraint.

Adversarial failure-mode testing. Drawing on Apollo’s systematic testing of abort procedures, evaluation should include adversarial scenarios in which the model actively attempts to subvert the evaluation, with pre-defined responses for each failure mode. “A risk we are actively monitoring” is not an abort procedure.

Transparent documentation of safety shortfalls. Modelled on Apollo’s exhaustive technical reporting, system cards and safety evaluations should document what was not tested, what was tested inadequately, and what was tested by the model under evaluation, with the same rigour applied to failures as to successes. Redaction for legitimate security purposes is appropriate; redaction that obscures the gap between safety aspirations and safety practice is not.

IX. The choice

In January 2026, Anthropic CEO Dario Amodei published a twenty-thousand-word essay titled “The Adolescence of Technology,” framing the current moment as a dangerous but navigable transitional period.

Six weeks later, his company released Opus 4.6 with a system card documenting the self-evaluation circularity. Two weeks after that, the Frontier Safety Roadmap was published with aspirational timelines and no ASL-4 definition. In March, a data leak revealed a “step change” model already in early access testing.

Amodei’s framing of the current period as an “adolescence” naturalises the trajectory. Adolescence is a developmental phase that resolves itself. But the space programme teaches us that the trajectory is not natural. It is chosen.

NASA chose to stop after Apollo 1. They chose to build independent verification structures. They chose to absorb the competitive cost of doing things right. And they reached the moon.

The question facing the AI industry is not whether rigorous safety infrastructure is compatible with rapid capability advancement. Apollo proved that it is.

The question is whether anyone with the power to act will choose the Apollo path before the Challenger pattern completes itself.

The Apollo programme is not remembered for the fire. It is remembered for the moon. But only because the people in charge, at the moment that mattered, chose to stop, fix what was broken, and build something worthy of the destination.