ryan_b comments on TurnTrout’s shortform feed

ryan_b 23 Feb 2026 17:13 UTC
2 points
0
I have been ruminating about this for a couple of days, and have settled on what feels like a stupid question in order that I should be less stupid:

Why isn’t the capability to distinguish between evals and reality the crux of the problem?

It seems like this ability to distinguish between worlds is at least sufficient for the model to deceptively maintain misaligned goals, even if it isn’t necessary.