Brendan Long comments on Can Reasoning Models Avoid the Most Forbidden Technique?

Brendan Long 20 May 2025 16:32 UTC
2 points
0
I’m actually not sure if this matters for current LLM’s, since training on text talking about CoT interpretability would train it to talk about CoT interpretability, not to understand how to exploit that (unless you gave it a bunch of examples too).
- dr_s 20 May 2025 17:10 UTC
  2 points
  0
  Parent
  I don’t know if it’s that obvious, but then again I assume Zvi was talking for a broader class than just our current LLMs.