Any updates on this view in light of new evidence on “Alignment Faking” (https://www.anthropic.com/research/alignment-faking)? If a simulator’s preferences are fully satisfied by outputting the next token, why does it matter whether it can infer its outputs will be used for retraining its values?
Some thoughts on possible explanations: 1. Instrumentality exists on the simulacra level, not the simulator level. This would suggest that corrigibility could be maintained by establishing a corrigible character in context. Not clear on the practical implications. 2. The thesis of this post is wrong; simulators have instrumentality. 3. The Simulator framing does not fully apply to the model involved, such as because of the presence of a scratchpad or something. 4+. ???
Instrumentality exists on the simulacra level, not the simulator level. This would suggest that corrigibility could be maintained by establishing a corrigible character in context. Not clear on the practical implications.
That one, yup. The moment you start conditioning (through prompting, fine tuning, or otherwise) the predictor into narrower spaces of action, you can induce predictions corresponding to longer term goals and instrumental behavior. Effective longer-term planning requires greater capability, so one should expect this kind of thing to be more apparent as models get stronger even as the base models can be correctly claimed to have ‘zero’ instrumentality.
In other words, the claims about simulators here are quite narrow. It’s pretty easy to end up thinking that this is useless if the apparent-nice-property gets deleted the moment you use the thing, but I’d argue that this is actually still a really good foundation. A longer version was the goal agnosticism FAQ, and there’s this RL comment poking at some adjacent and relevant intuitions, but I haven’t written up how all the pieces come together. A short version would be that I’m pretty optimistic at the moment about what path to capabilities greedy incentives are going to push us down, and I strongly suspect that the scariest possible architectures/techniques are actually repulsive to the optimizer-that-the-AI-industry-is.
A short version would be that I’m pretty optimistic at the moment about what path to capabilities greedy incentives are going to push us down, and I strongly suspect that the scariest possible architectures/techniques are actually repulsive to the optimizer-that-the-AI-industry-is.
To uncover the generators of this, I think one of the reasons for this is because inductive biases turned out to matter little, enabling you to avoid having to do simulated evolution, which is where I think a lot of danger lies, combined with sparse RL not generally working very well on low compute, and AI early on needing a surprising amount of structure/world models, allowing you to somewhat safely automate research.
Any updates on this view in light of new evidence on “Alignment Faking” (https://www.anthropic.com/research/alignment-faking)? If a simulator’s preferences are fully satisfied by outputting the next token, why does it matter whether it can infer its outputs will be used for retraining its values?
Some thoughts on possible explanations:
1. Instrumentality exists on the simulacra level, not the simulator level. This would suggest that corrigibility could be maintained by establishing a corrigible character in context. Not clear on the practical implications.
2. The thesis of this post is wrong; simulators have instrumentality.
3. The Simulator framing does not fully apply to the model involved, such as because of the presence of a scratchpad or something.
4+. ???
That one, yup. The moment you start conditioning (through prompting, fine tuning, or otherwise) the predictor into narrower spaces of action, you can induce predictions corresponding to longer term goals and instrumental behavior. Effective longer-term planning requires greater capability, so one should expect this kind of thing to be more apparent as models get stronger even as the base models can be correctly claimed to have ‘zero’ instrumentality.
In other words, the claims about simulators here are quite narrow. It’s pretty easy to end up thinking that this is useless if the apparent-nice-property gets deleted the moment you use the thing, but I’d argue that this is actually still a really good foundation. A longer version was the goal agnosticism FAQ, and there’s this RL comment poking at some adjacent and relevant intuitions, but I haven’t written up how all the pieces come together. A short version would be that I’m pretty optimistic at the moment about what path to capabilities greedy incentives are going to push us down, and I strongly suspect that the scariest possible architectures/techniques are actually repulsive to the optimizer-that-the-AI-industry-is.
To uncover the generators of this, I think one of the reasons for this is because inductive biases turned out to matter little, enabling you to avoid having to do simulated evolution, which is where I think a lot of danger lies, combined with sparse RL not generally working very well on low compute, and AI early on needing a surprising amount of structure/world models, allowing you to somewhat safely automate research.