I have been ruminating about this for a couple of days, and have settled on what feels like a stupid question in order that I should be less stupid:
Why isn’t the capability to distinguish between evals and reality the crux of the problem?
It seems like this ability to distinguish between worlds is at least sufficient for the model to deceptively maintain misaligned goals, even if it isn’t necessary.
I have been ruminating about this for a couple of days, and have settled on what feels like a stupid question in order that I should be less stupid:
Why isn’t the capability to distinguish between evals and reality the crux of the problem?
It seems like this ability to distinguish between worlds is at least sufficient for the model to deceptively maintain misaligned goals, even if it isn’t necessary.