To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
Sure. But knowing the details of how it’s happening under the hood—whether it’s something that the model in some sense intentionally chooses or not—seems important for figuring out how to avoid it in the future.
Sure. But knowing the details of how it’s happening under the hood—whether it’s something that the model in some sense intentionally chooses or not—seems important for figuring out how to avoid it in the future.