I think I may be confused about the argument being made in the ‘Deceptively Aligned Models’ section, and am restating my understanding here to see if you agree. [And if not, clarification on what I’ve got wrong would be very helpful!]
I think I understand the previous two sections:
Models that converge to internally aligned states do so very slowly, because as they become more internally aligned it gets less and less likely that they encounter examples which differentiate between the proxy and base objectives.
Models that converge to corrigibly aligned states do so very slowly, because as their pointers to the base objective become better it gets less and less likely that they encounter examples which can shift the pointer towards the base objective.
My best attempt to restate the argument in the deceptively aligned models section is something like:
Models that converge to deceptively aligned states at some point notice they’re in training, and this happens before e.g. a model can converge to a corrigibly aligned state.
At that point, SGD pushes them to better model the training process, because that helps them perform well on the base objective during training. So they learn deception.
Models that learn deception learn the pointer to the base objective at runtime rather than via SGD. To the extent that the models are able to build powerful optimization processes, this might be more efficient than SGD.
Assuming the above, models that learn deception manage to learn the pointer to the base objective faster than models that converge to corrigible states do, and faster than internally aligned models converge on a model of the base objective proper.
As a result, starting from a random initialization the first state you hit on is likely to be a deceptive one.
Is that right?
If it is, one possible issue is that a lot of work is being done by two pieces:
It is easier to learn a pointer to the base objective at runtime than during training.
Deceptive alignment, unlike internal or corrigible alignment, allows learning a pointer during runtime, so (1) favors deception.
I agree that (1) is likely, but (2) is less clear. I think a model could have a proxy objective of “learn the base objective at runtime and follow that”, and so be corrigibly aligned while still getting the benefits of runtime learning. A counter-counter point is that that is an unlikely proxy objective to have learned early in training, and I’m not sure how to think about that yet...
I think I may be confused about the argument being made in the ‘Deceptively Aligned Models’ section, and am restating my understanding here to see if you agree. [And if not, clarification on what I’ve got wrong would be very helpful!]
I think I understand the previous two sections:
Models that converge to internally aligned states do so very slowly, because as they become more internally aligned it gets less and less likely that they encounter examples which differentiate between the proxy and base objectives.
Models that converge to corrigibly aligned states do so very slowly, because as their pointers to the base objective become better it gets less and less likely that they encounter examples which can shift the pointer towards the base objective.
My best attempt to restate the argument in the deceptively aligned models section is something like:
Models that converge to deceptively aligned states at some point notice they’re in training, and this happens before e.g. a model can converge to a corrigibly aligned state.
At that point, SGD pushes them to better model the training process, because that helps them perform well on the base objective during training. So they learn deception.
Models that learn deception learn the pointer to the base objective at runtime rather than via SGD. To the extent that the models are able to build powerful optimization processes, this might be more efficient than SGD.
Assuming the above, models that learn deception manage to learn the pointer to the base objective faster than models that converge to corrigible states do, and faster than internally aligned models converge on a model of the base objective proper.
As a result, starting from a random initialization the first state you hit on is likely to be a deceptive one.
Is that right?
If it is, one possible issue is that a lot of work is being done by two pieces:
It is easier to learn a pointer to the base objective at runtime than during training.
Deceptive alignment, unlike internal or corrigible alignment, allows learning a pointer during runtime, so (1) favors deception.
I agree that (1) is likely, but (2) is less clear. I think a model could have a proxy objective of “learn the base objective at runtime and follow that”, and so be corrigibly aligned while still getting the benefits of runtime learning. A counter-counter point is that that is an unlikely proxy objective to have learned early in training, and I’m not sure how to think about that yet...
I basically agree with your criticism here, and this is the same intuition that led to this post: https://www.lesswrong.com/posts/KSWSkxXJqWGd5jYLB/the-speed-simplicity-prior-is-probably-anti-deceptive