Neel Nanda comments on Daniel Kokotajlo’s Shortform

Neel Nanda 2 Oct 2025 23:51 UTC
LW: 5 AF: 4
0
AF
I disagree. Some analogies: If someone tells me, by the way, this is a trick question, I’m a lot better at answering trick questions. If someone tells me the maths problem I’m currently trying to solve was put on a maths Olympiad, that tells me a lot of useful information, like that it’s going to have a reasonably nice solution, which makes it much easier to solve. And if someone’s tricking me into doing something, then being told, by the way, you’re being tested right now, is extremely helpful because I’ll pay a lot more attention and be a lot more critical, and I might notice the trick.
- Daniel Kokotajlo 3 Oct 2025 0:06 UTC
  LW: 2 AF: 2
  0
  AF Parent
  OK, good points re: those examples. But can you come up with an example that’s more analogous to the cases with Claude? E.g. you are given a difficult task by your boss, and you are considering cheating a bit and pretending you did the task straightforwardly when really you cheated… and then you learn that this is all a test of your ethics. And so you decide not to cheat. Is there a benign explanation of that pattern?
  - sam b 3 Oct 2025 0:49 UTC
    1 point
    0
    Parent
    From the published cases, I could not find a clear example of Sonnet 4.5 following that particular pattern, but maybe I missed it. Could you share?
    - Daniel Kokotajlo 3 Oct 2025 0:56 UTC
      3 points
      1
      Parent
      I made it up, sorry, it seemed like roughly the sort of thing that was happening? I should pick an example from the actual cases.
      - sam b 3 Oct 2025 1:35 UTC
        1 point
        0
        Parent
        For what it’s worth, I would expect the behavior you described, and suspect a malicious explanation overall.
        Anecdotally, Claude lies/cheats an enormous amount (far more than comparable frontier models).