I’m actually not sure if this matters for current LLM’s, since training on text talking about CoT interpretability would train it to talk about CoT interpretability, not to understand how to exploit that (unless you gave it a bunch of examples too).
I don’t know if it’s that obvious, but then again I assume Zvi was talking for a broader class than just our current LLMs.
I’m actually not sure if this matters for current LLM’s, since training on text talking about CoT interpretability would train it to talk about CoT interpretability, not to understand how to exploit that (unless you gave it a bunch of examples too).
I don’t know if it’s that obvious, but then again I assume Zvi was talking for a broader class than just our current LLMs.