For the toy model, I’d like it if you explicitly showed that there are inputs on which the CLT replacement model gives very different outputs to the original model. You could then check whether the PLT replacement model performs better than the CLT one on these inputs. This result should be simple to obtain: just choose a nonbinary input, i.e. an (a,b,c,d) that is not in {−1,1}4. But it would make the toy model more compelling to me, since, to me, faithfulness is really about OOD generalization.
Thanks for your thoughtful comment! I think that’s an interesting idea. If a replacement model is faithful, you would indeed expect it to behave similar when tested on other data. This may possibly connect to a related finding that we couldn’t yet explain: that CLTs fail for memorized sequences, when those should be the simplest. Maybe the OOD test is something that we could do in real LLMs as well, not just toy models.
I don’t think I’d go as far as saying “faithfulness==OOD generalization” though. I think it is possible that two different circuits produce the same behavior, even on OOD. For example, let model A and B learn to multiplicate arbitrary numbers perfectly. There are different ways to do x * y, for example compute +x y times, do it like we learned in school, etc. In those cases, OOD examples would still deliver equal output but the internal circuits are different. So I’d say that faithfulness is a stronger claim than OOD generalization.
I won’t argue about definitions of faithfulness—yours seems as fine as any—but I think I only care about unfaithfulness in as far as it causes our circuits to make incorrect predictions about OOD behavior.
For the toy model, I’d like it if you explicitly showed that there are inputs on which the CLT replacement model gives very different outputs to the original model. You could then check whether the PLT replacement model performs better than the CLT one on these inputs. This result should be simple to obtain: just choose a nonbinary input, i.e. an (a,b,c,d) that is not in {−1,1}4. But it would make the toy model more compelling to me, since, to me, faithfulness is really about OOD generalization.
Thanks for your thoughtful comment! I think that’s an interesting idea. If a replacement model is faithful, you would indeed expect it to behave similar when tested on other data. This may possibly connect to a related finding that we couldn’t yet explain: that CLTs fail for memorized sequences, when those should be the simplest. Maybe the OOD test is something that we could do in real LLMs as well, not just toy models.
I don’t think I’d go as far as saying “faithfulness==OOD generalization” though. I think it is possible that two different circuits produce the same behavior, even on OOD. For example, let model A and B learn to multiplicate arbitrary numbers perfectly. There are different ways to do x * y, for example compute +x y times, do it like we learned in school, etc. In those cases, OOD examples would still deliver equal output but the internal circuits are different. So I’d say that faithfulness is a stronger claim than OOD generalization.
I won’t argue about definitions of faithfulness—yours seems as fine as any—but I think I only care about unfaithfulness in as far as it causes our circuits to make incorrect predictions about OOD behavior.