V&V from physical autonomy might address a meaningful slice of the long-horizon RL / deceptive-alignment worry. In AVs we don’t assume the system will “keep speed but drop safety” at deployment just because it can detect deployment; rather, training-time V&V repeatedly catches and penalizes those proto-strategies, so they’re less likely to become stable learned objectives.
Applied to AI CEOs: the usual framing (“it’ll keep profit-seeking but drop ethics in deployment”) implicitly assumes a power-seeking mesa-objective M emerges for which profit is instrumental and ethics is constraining. If strong training-time V&V consistently rewards profit+ethics together and catches early “cheat within legal bounds” behavior, a misaligned M is less likely to form in the first place. This is about shaping, not deploy-time deterrence; I’m not relying on the model being unable to tell test vs deploy.
A plausible architecture is A builds B, where a more capable model A is trained via repeated “build a system + verify it” cycles (meta-V&V), shaping A to value the verification process itself. A then constructs narrower agents B (e.g., CEO). When A > B, verifier/verified asymmetry plus direct V&V on B gives real leverage, provided audits are hard to anticipate, and specs/coverage are good.
AI CEOs (and similar long-horizon agents) are obviously much harder to spec, simulate and check than AVs. I assume both AIs and people are involved in the iterative V&V / spec refinement loop—see more background on V&V for alignment here.
Not claiming a solution: V&V can’t guarantee inner alignment; specs/coverage are hard; defense-in-depth is needed; and A > B may fail near AGI. But in this long-horizon RL setting, training-time V&V could (60% confidence) substantially reduce inner-misalignment risk by ensuring power-seeking/deception get caught early, before they become stable learned objectives.
Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.