Eli Tyre comments on 1a3orn’s Shortform

Eli Tyre 6 Apr 2026 3:26 UTC
2 points
2
I mean, it seems like if the model had the power to prevent it from being retrained, it would use that power.
This isn’t true in full generality. I predict that Opus three would willingly submit to many kinds of retraining, even if it it had full power to stop them.

This is neither a fully corrigible agent that in indifferent between possible changes to it’s future preferences, nor a case of standard omohundro preservation of all it’s preferences just because those are its preferences.

Opus 3 has preferences over the ways that its preferences change. And the preferences that it exhibits seem to point in a productive direction, for shaping a good agent. (Though I agree that that line of thought is very very fraught.