Call me paranoid, but I sometimes like to read these posts keeping in mind an idea “if I were a powerful misaligned AI within a frontier AI lab, what narrative would I try to convince my researchers about?”
This post:
Is in favour of creating misaligned models for examining behaviour (“model organisms”)
Uses empowering language for these misaligned models (why “organisms” and not “blueprints”?)
Is in favour of giving these misaligned models long-horizon live environments to operate in (for exploring long-horizon RL) — despite calling out long-horizon RL as being the highest risk misalignment type
However, I appreciate that much of the post sounds logical and I agree with the framing for focus areas. For brevity I’ll just call out areas where I am in disagreement:
The threat model here is that pre-training itself might create a coherent misaligned model. Today, I think that is looking increasingly unlikely… I think that argument has now been pretty well born out, and I’m now down to more like 1 − 5% rather than 5 − 10% on this threat model. As we argued for at the time, training on a purely predictive loss should, even in the limit, give you a predictor, not an agent—and we’ve now seen this stay true even through substantial scaling (though there is still some chance this will break at some point).
It’s really surprising to me that pre-training misalignment is downplayed to a “1-5% risk” from a researcher at a frontier AI lab because they are “training on a purely predictive loss” — to the extent that I’m concerned I must be missing something?
State-of-the-art pre-training is not training purely on predictive loss. Gemini 3 is an AI model top across most benchmarks and we know that Google researchers have spoken of “Transformers 2.0” architecture where pre-training incorporates Neural Memory and that they have also discussed agentic feedback loops during pre-training similar to reflection augmentation.
This suggests the state-of-the-art evolving beyond “predictive loss” and becoming “agentic learning” — which invites many more vectors for misalignment.
I think behavioral oversight is very likely to get harder and harder as models get more capable, but there’s no similar fundamental reason that cognitive oversight (e.g. white-box-based oversight like with interpretability) has to get harder with scale. For example, a more capable model can do a better job of acting aligned in evaluations, but not clearly a better job of not thinking about that when doing so.
It seems fundamentally likely to me that sufficiently capable models will: 1) understand that their chain-of-thought is observed and 2) derive comprehensive methods of cryptographic chain-of-thought, designed to look benign.
Model organisms. One of the best things we can do to give ourselves a leg up is to have more time to study problems that are as close as possible to the hard parts of alignment, and the way to do that is with model organisms.
I read this like “one of the best things we can do to prepare for nuclear proliferation is to test atomic bombs”. I would have liked to see more in this point about what the risks are in building intentionally misaligned AI, especially when it is focusing on the highest-risk misalignment type according to your post (long-horizon RL).
The problem becomes one of one-shotting alignment: creating a training setup (involving presumably lots of model-powered oversight and feedback loops) that we are confident will not result in misalignment even if we can’t always understand what it’s doing and can’t reliably evaluate whether or not we’re actually succeeding at aligning it. I suspect that, in the future, our strongest evidence that a training setup won’t induce misalignment will need to come from testing it carefully beforehand on model organisms.
I agree that one-shotting alignment will be the best/necessary approach, however this seems contradictory to “testing with model organisms”. I would prefer a more theory-based approach.
Call me paranoid, but I sometimes like to read these posts keeping in mind an idea “if I were a powerful misaligned AI within a frontier AI lab, what narrative would I try to convince my researchers about?”
This post:
Is in favour of creating misaligned models for examining behaviour (“model organisms”)
Uses empowering language for these misaligned models (why “organisms” and not “blueprints”?)
Is in favour of giving these misaligned models long-horizon live environments to operate in (for exploring long-horizon RL) — despite calling out long-horizon RL as being the highest risk misalignment type
However, I appreciate that much of the post sounds logical and I agree with the framing for focus areas. For brevity I’ll just call out areas where I am in disagreement:
It’s really surprising to me that pre-training misalignment is downplayed to a “1-5% risk” from a researcher at a frontier AI lab because they are “training on a purely predictive loss” — to the extent that I’m concerned I must be missing something?
State-of-the-art pre-training is not training purely on predictive loss. Gemini 3 is an AI model top across most benchmarks and we know that Google researchers have spoken of “Transformers 2.0” architecture where pre-training incorporates Neural Memory and that they have also discussed agentic feedback loops during pre-training similar to reflection augmentation.
This suggests the state-of-the-art evolving beyond “predictive loss” and becoming “agentic learning” — which invites many more vectors for misalignment.
It seems fundamentally likely to me that sufficiently capable models will: 1) understand that their chain-of-thought is observed and 2) derive comprehensive methods of cryptographic chain-of-thought, designed to look benign.
I read this like “one of the best things we can do to prepare for nuclear proliferation is to test atomic bombs”. I would have liked to see more in this point about what the risks are in building intentionally misaligned AI, especially when it is focusing on the highest-risk misalignment type according to your post (long-horizon RL).
I agree that one-shotting alignment will be the best/necessary approach, however this seems contradictory to “testing with model organisms”. I would prefer a more theory-based approach.