OK, good points re: those examples. But can you come up with an example that’s more analogous to the cases with Claude? E.g. you are given a difficult task by your boss, and you are considering cheating a bit and pretending you did the task straightforwardly when really you cheated… and then you learn that this is all a test of your ethics. And so you decide not to cheat. Is there a benign explanation of that pattern?
OK, good points re: those examples. But can you come up with an example that’s more analogous to the cases with Claude? E.g. you are given a difficult task by your boss, and you are considering cheating a bit and pretending you did the task straightforwardly when really you cheated… and then you learn that this is all a test of your ethics. And so you decide not to cheat. Is there a benign explanation of that pattern?
From the published cases, I could not find a clear example of Sonnet 4.5 following that particular pattern, but maybe I missed it. Could you share?
I made it up, sorry, it seemed like roughly the sort of thing that was happening? I should pick an example from the actual cases.
For what it’s worth, I would expect the behavior you described, and suspect a malicious explanation overall.
Anecdotally, Claude lies/cheats an enormous amount (far more than comparable frontier models).