V&V from physical autonomy might address a meaningful slice of the long-horizon RL / deceptive-alignment worry. In AVs we don’t assume the system will “keep speed but drop safety” at deployment just because it can detect deployment; rather, training-time V&V repeatedly catches and penalizes those proto-strategies, so they’re less likely to become stable learned objectives.
Applied to AI CEOs: the usual framing (“it’ll keep profit-seeking but drop ethics in deployment”) implicitly assumes a power-seeking mesa-objective M emerges for which profit is instrumental and ethics is constraining. If strong training-time V&V consistently rewards profit+ethics together and catches early “cheat within legal bounds” behavior, a misaligned M is less likely to form in the first place. This is about shaping, not deploy-time deterrence; I’m not relying on the model being unable to tell test vs deploy.
A plausible architecture is A builds B, where a more capable model A is trained via repeated “build a system + verify it” cycles (meta-V&V), shaping A to value the verification process itself. A then constructs narrower agents B (e.g., CEO). When A > B, verifier/verified asymmetry plus direct V&V on B gives real leverage, provided audits are hard to anticipate, and specs/coverage are good.
AI CEOs (and similar long-horizon agents) are obviously much harder to spec, simulate and check than AVs. I assume both AIs and people are involved in the iterative V&V / spec refinement loop—see more background on V&V for alignment here.
Not claiming a solution: V&V can’t guarantee inner alignment; specs/coverage are hard; defense-in-depth is needed; and A > B may fail near AGI. But in this long-horizon RL setting, training-time V&V could (60% confidence) substantially reduce inner-misalignment risk by ensuring power-seeking/deception get caught early, before they become stable learned objectives.
That idea of catching bad mesa-objectives during training sounds key and I presume fits under the ‘generalization science’ and ‘robust character training’ from Evan’s original post. In the US, NIST is working to develop test, evaluation, verification and validation standards for AI and it would be good to include this concept into that effort.
Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.
V&V from physical autonomy might address a meaningful slice of the long-horizon RL / deceptive-alignment worry. In AVs we don’t assume the system will “keep speed but drop safety” at deployment just because it can detect deployment; rather, training-time V&V repeatedly catches and penalizes those proto-strategies, so they’re less likely to become stable learned objectives.
Applied to AI CEOs: the usual framing (“it’ll keep profit-seeking but drop ethics in deployment”) implicitly assumes a power-seeking mesa-objective M emerges for which profit is instrumental and ethics is constraining. If strong training-time V&V consistently rewards profit+ethics together and catches early “cheat within legal bounds” behavior, a misaligned M is less likely to form in the first place. This is about shaping, not deploy-time deterrence; I’m not relying on the model being unable to tell test vs deploy.
A plausible architecture is A builds B, where a more capable model A is trained via repeated “build a system + verify it” cycles (meta-V&V), shaping A to value the verification process itself. A then constructs narrower agents B (e.g., CEO). When A > B, verifier/verified asymmetry plus direct V&V on B gives real leverage, provided audits are hard to anticipate, and specs/coverage are good.
AI CEOs (and similar long-horizon agents) are obviously much harder to spec, simulate and check than AVs. I assume both AIs and people are involved in the iterative V&V / spec refinement loop—see more background on V&V for alignment here.
Not claiming a solution: V&V can’t guarantee inner alignment; specs/coverage are hard; defense-in-depth is needed; and A > B may fail near AGI. But in this long-horizon RL setting, training-time V&V could (60% confidence) substantially reduce inner-misalignment risk by ensuring power-seeking/deception get caught early, before they become stable learned objectives.
That idea of catching bad mesa-objectives during training sounds key and I presume fits under the ‘generalization science’ and ‘robust character training’ from Evan’s original post. In the US, NIST is working to develop test, evaluation, verification and validation standards for AI and it would be good to include this concept into that effort.
Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.