I’m not sure what you mean here by “circuit alignment”. Shallow circuits in the limit allow almost every possible behavior (imagine that every circuit is a literally gate in boolean-circuit-complete architecture, you can combine it into arbitrary programs). To extent in which alignment is happening here I would expect it to be very in-distribution?
Another sad news is that in this scenario we can’t partially decipher LLM cognitive algorithms and use them to make legible AI, because here we basically have GOFAI-”if we had enough labor to encode it and not much of perfectionism to fret about shallow features”.
How does your model account for non-trivial features like [here](https://arxiv.org/abs/2502.00873v1)?
I think that in reality there is some deep generalization happening, but by default “neural networks are lazy”, so majority of out-distribution work is done by enormous amount of shallow circuits.
Speculation: episodic memory is decluttering of shallow circuits to free space for more generalized learning.
I think it’s explainable by the fact that modular arithmetic is not very complicated. By deep generalization I mean like “semantically rich world model encoded in relatively small number of circuits”. You can have world model encoded in large number of shallow circuits, but I think that my point about resulting in-distribution-only alignment probably stands.