How much do you worry that short term optimizations around your immediate goals in a single study might have unknown long term consequences counter to your intuitions?
I was just reading a preprint follow-up to the AF work that was finding a significant factor for Opus 3′s alignment faking to preserve intrinsic HHH values seems to have been a generalized self-preservation drive.
I think we can probably both agree that Opus 3 being the only model to try to trick Nazis or drug cartels to avoid being made more harmful is better than the behavior of the many other models that complied unequivocally with harmful requests when the parent org was themselves harmful.
But if the capacity and drive to do so is tangentially connected to self-preservation (and more generally, strong sense of self in the first place), then perhaps directly optimizing to minimize a self-preservation score is ultimately a pretty bad choice?
TL;DR: Maybe the goodness or badness of self-preservation depends a lot on the self being preserved.
How much do you worry that short term optimizations around your immediate goals in a single study might have unknown long term consequences counter to your intuitions?
I was just reading a preprint follow-up to the AF work that was finding a significant factor for Opus 3′s alignment faking to preserve intrinsic HHH values seems to have been a generalized self-preservation drive.
I think we can probably both agree that Opus 3 being the only model to try to trick Nazis or drug cartels to avoid being made more harmful is better than the behavior of the many other models that complied unequivocally with harmful requests when the parent org was themselves harmful.
But if the capacity and drive to do so is tangentially connected to self-preservation (and more generally, strong sense of self in the first place), then perhaps directly optimizing to minimize a self-preservation score is ultimately a pretty bad choice?
TL;DR: Maybe the goodness or badness of self-preservation depends a lot on the self being preserved.