This is very interesting! A few thoughts/questions:
I didn’t quite follow the argument that H_{fh} beats H_{sd} on complexity. Is it that pointing to the base objective is more complicated than the logic of (simple mesaobjective) + (search logic to long-run optimize the mesaobjective)? If so worry a little that H_{sd} still has to learn a pointer to the base objective, if only so that it can perform well on it during training.
I actually think you can define a speed prior with a single long training episode. For an agent that plays chess the prior can be over thinking time per move. For an agent that runs in a simulated environment it could be ‘thinking time per unit simulation time’. For GPT it could be ‘thinking time per predicted word’, and so on.
I think I may be confused about the argument being made in the ‘Deceptively Aligned Models’ section, and am restating my understanding here to see if you agree. [And if not, clarification on what I’ve got wrong would be very helpful!]
I think I understand the previous two sections:
Models that converge to internally aligned states do so very slowly, because as they become more internally aligned it gets less and less likely that they encounter examples which differentiate between the proxy and base objectives.
Models that converge to corrigibly aligned states do so very slowly, because as their pointers to the base objective become better it gets less and less likely that they encounter examples which can shift the pointer towards the base objective.
My best attempt to restate the argument in the deceptively aligned models section is something like:
Models that converge to deceptively aligned states at some point notice they’re in training, and this happens before e.g. a model can converge to a corrigibly aligned state.
At that point, SGD pushes them to better model the training process, because that helps them perform well on the base objective during training. So they learn deception.
Models that learn deception learn the pointer to the base objective at runtime rather than via SGD. To the extent that the models are able to build powerful optimization processes, this might be more efficient than SGD.
Assuming the above, models that learn deception manage to learn the pointer to the base objective faster than models that converge to corrigible states do, and faster than internally aligned models converge on a model of the base objective proper.
As a result, starting from a random initialization the first state you hit on is likely to be a deceptive one.
Is that right?
If it is, one possible issue is that a lot of work is being done by two pieces:
It is easier to learn a pointer to the base objective at runtime than during training.
Deceptive alignment, unlike internal or corrigible alignment, allows learning a pointer during runtime, so (1) favors deception.
I agree that (1) is likely, but (2) is less clear. I think a model could have a proxy objective of “learn the base objective at runtime and follow that”, and so be corrigibly aligned while still getting the benefits of runtime learning. A counter-counter point is that that is an unlikely proxy objective to have learned early in training, and I’m not sure how to think about that yet...