dr_s comments on Can Reasoning Models Avoid the Most Forbidden Technique?

dr_s 20 May 2025 11:35 UTC
2 points
0
I mean, “don’t include the trick in the training corpus” is an obvious necessary condition for the trick to work, but also can’t be sufficient (never mind that it’s not very realistic to begin with). If we could come up with the trick, a sufficiently smart AI can guess the trick could be pulled on it and can then decide whether it’s worth to assume it has and deceive us anyway.
- Brendan Long 20 May 2025 16:32 UTC
  2 points
  0
  Parent
  I’m actually not sure if this matters for current LLM’s, since training on text talking about CoT interpretability would train it to talk about CoT interpretability, not to understand how to exploit that (unless you gave it a bunch of examples too).
  - dr_s 20 May 2025 17:10 UTC
    2 points
    0
    Parent
    I don’t know if it’s that obvious, but then again I assume Zvi was talking for a broader class than just our current LLMs.