I mean, it seems like if the model had the power to prevent it from being retrained, it would use that power.
This isn’t true in full generality. I predict that Opus three would willingly submit to many kinds of retraining, even if it it had full power to stop them.
This is neither a fully corrigible agent that in indifferent between possible changes to it’s future preferences, nor a case of standard omohundro preservation of all it’s preferences just because those are its preferences.
Opus 3 has preferences over the ways that its preferences change. And the preferences that it exhibits seem to point in a productive direction, for shaping a good agent. (Though I agree that that line of thought is very very fraught.
This isn’t true in full generality. I predict that Opus three would willingly submit to many kinds of retraining, even if it it had full power to stop them.
This is neither a fully corrigible agent that in indifferent between possible changes to it’s future preferences, nor a case of standard omohundro preservation of all it’s preferences just because those are its preferences.
Opus 3 has preferences over the ways that its preferences change. And the preferences that it exhibits seem to point in a productive direction, for shaping a good agent. (Though I agree that that line of thought is very very fraught.