Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 3 Oct 2025 0:06 UTC
LW: 3 AF: 2
0
AF
OK, good points re: those examples. But can you come up with an example that’s more analogous to the cases with Claude? E.g. you are given a difficult task by your boss, and you are considering cheating a bit and pretending you did the task straightforwardly when really you cheated… and then you learn that this is all a test of your ethics. And so you decide not to cheat. Is there a benign explanation of that pattern?
- sam b 3 Oct 2025 0:49 UTC
  1 point
  0
  Parent
  From the published cases, I could not find a clear example of Sonnet 4.5 following that particular pattern, but maybe I missed it. Could you share?
  - Daniel Kokotajlo 3 Oct 2025 0:56 UTC
    4 points
    1
    Parent
    I made it up, sorry, it seemed like roughly the sort of thing that was happening? I should pick an example from the actual cases.
    - sam b 3 Oct 2025 1:35 UTC
      1 point
      0
      Parent
      For what it’s worth, I would expect the behavior you described, and suspect a malicious explanation overall.
      Anecdotally, Claude lies/cheats an enormous amount (far more than comparable frontier models).