I mean, “don’t include the trick in the training corpus” is an obvious necessary condition for the trick to work, but also can’t be sufficient (never mind that it’s not very realistic to begin with). If we could come up with the trick, a sufficiently smart AI can guess the trick could be pulled on it and can then decide whether it’s worth to assume it has and deceive us anyway.
I’m actually not sure if this matters for current LLM’s, since training on text talking about CoT interpretability would train it to talk about CoT interpretability, not to understand how to exploit that (unless you gave it a bunch of examples too).
I mean, “don’t include the trick in the training corpus” is an obvious necessary condition for the trick to work, but also can’t be sufficient (never mind that it’s not very realistic to begin with). If we could come up with the trick, a sufficiently smart AI can guess the trick could be pulled on it and can then decide whether it’s worth to assume it has and deceive us anyway.
I’m actually not sure if this matters for current LLM’s, since training on text talking about CoT interpretability would train it to talk about CoT interpretability, not to understand how to exploit that (unless you gave it a bunch of examples too).
I don’t know if it’s that obvious, but then again I assume Zvi was talking for a broader class than just our current LLMs.