I’ve been thinking about neuralese recently, and I’m wondering if there’s more of a tradeoff here with interpretability. If we actually could train a model to use neuralese[1], we might be able to make it much smaller and easier to interpret (since it won’t need multiple layers to express long-term dependencies). This would make the tokenized output less interpretable but would potentially remove even-less interpretable inner layers.
I’m kind of bearish on mechanistic interpretability giving us very complete and satisfying answers about what’s going on in models. Neel Nanda also seems to believe this. For the kind of stuff we can do today, like training SAEs or probes, it seems like the difficulty isn’t that affected by model size. My vague impression is that scaling up to larger models is bottlenecked primarily on compute, but not that much compute.
Maybe if ambitious interpretability works out, a smaller model will be easier to interpret? But the effect doesn’t obviously seem that large to me. Maybe if neuralese lets you make a 2x smaller model, this makes it more than 2x easier to interpret.
Also, it’s unclear to me how much smaller you could make the model with neuralese, if at all. (Some) long-term dependencies can already be expressed in the CoT. Maybe many layers are required to express certain otherwise-inarticulable long-term dependencies? It seems plausible that most useful inarticulable thoughts only require short-term cognition. Maybe the part of the post about “picturing a geometric object and turning it around in one’s mind for minutes at a time” is a counterexample to this.
If it’s possible for the model to reuse the same parameters again and again to improve its own understanding, maybe it’s already doing this alongside the CoT by doing something like “thinking dot by dot” in its CoT. Actually, I started writing this and realized this probably proves too much, because it means that continuous thought is never better than thinking dot by dot. I’m showing my ignorance of the details of transformer architecture, but this is supposed to be off-the-cuff, so I’ll leave it here for now.
Thanks for the response! That makes a lot of sense.
My intuition is that making the model less deep would make it non-linearly easier to interpret, since the last layers of the model is the most interpretable (since it’s closest to human-understable tokens and embeddings), while the earlier layers are interpretable (since they’re farther away). Also if layers end up doing duplicate work that could make interpreting the deduplicated version easier. I’m not sure though.
For “thinking dot by dot”, I think serial computation can only benefit from roughly as many filler tokens as you have layers, since each layer can only attend to previous layers. For example, if you had a two layer model, position i does some calculations in layer 1, position i+1 can attend to the layer 1 results from position i in layer 2, but at position i+2 there’s no layer 3 to attend to the layer 2 outputs from position i+1.Actually LLMs can’t use filler tokens for sequential reasoning at all.
Note how the only difference between position i and i+1 is an additional ”.” input and the positional encoding.
The dot-by-dot paper uses a highly parallel algorithm, so I think it’s more like adding dynamic attention heads / increasing the width of the network, while CoT is more like increasing the number of layers.
If I understand the architecture right (I might not), standard models are already effectively doing neuralese CoT, but bounded by the depth of the network (and also with the caveat that the model has fewer layers remaining to use the results after each step).
Edit: I’m working on a longer post about this but I actually think the dot-by-dot paper is evidence against LLMs being about to do most kinds of reasoning with filler tokens.
Thanks for the explanation of how thinking dot by dot works! That makes sense.
I actually think the dot-by-dot paper is evidence against LLMs being about to do most kinds of reasoning with filler tokens
Yeah, I’m just reading the abstract of the paper, but it looks like it required very “specific” training on their problem to get it to work. I guess this is why you think it’s evidence against? But maybe with algorithmic improvements, or due to a CoT where the only goal is getting higher reward rather than looking good to the user, LLMs will naturally gain the ability to use filler tokens effectively.
My intuition is that making the model less deep would make it non-linearly easier to interpret, since the last layers of the model is the most interpretable (since it’s closest to human-understandable tokens and embeddings), while the earlier layers are interpretable (since they’re farther away).
Well, if some reasoning is inarticulable in human language, the circuits implementing that reasoning would probably be difficult to interpret regardless of what layer they appear in the model. If there’s reasoning that is articulable in human language, then presumably the model could just write that reasoning in CoT tokens. Steelman: certain reasoning is almost inarticulable (such that it’s impossible to entirely represent it in CoT), but nonetheless pretty “close” to being articulable. If you reduce the number of layers, then the circuit for that reasoning will end up reusing some of the human-language circuitry, whereas otherwise the whole circuit would have been completely inarticulable.
The best reason I have for why reducing model size could nonlinearly improve interpretability is that there are dependencies between different parts of the network, and so the computation you’d need to understand all of these dependencies is O(n^2) or worse.
Anyway, this still seems pretty speculative. I think this argument would probably affect my actions very little or not at all—like, I still think going from CoT to neuralese would be bad and would want to discourage it. But if I had more understanding of mech interp I would feel less uncertain about it.
The post I was working on is out, “Filler tokens don’t allow sequential reasoning”. I think LLMs are fundamentally incapable of using filler tokens for sequential reasoning (do x and then based on that do y). I also think LLMs are unlikely to stumble upon the algorithm they came up with in the paper through RL.
Well, if some reasoning is inarticulable in human language, the circuits implementing that reasoning would probably be difficult to interpret regardless of what layer they appear in the model.
This is a good point, and I think I’ve been missing part of the tradeoff here. If we force the model to output human-understandable concepts at every step, we encourage its thinking to be human-understandable. Removing that incentive would plausibly make the model smaller, but the additional complexity comes from the model doing interpretability for us (both in making its thinking closer-to-human and in building a pipeline to expose those thoughts as words).
Thanks for the back-and-forth on this, it’s been very helpful!
I’ve been thinking about neuralese recently, and I’m wondering if there’s more of a tradeoff here with interpretability. If we actually could train a model to use neuralese[1], we might be able to make it much smaller and easier to interpret (since it won’t need multiple layers to express long-term dependencies). This would make the tokenized output less interpretable but would potentially remove even-less interpretable inner layers.
What do you think?
I’m actually very skeptical that this is possible without a major breakthrough or some very expensive data.
This is an interesting point. Off the cuff take:
I’m kind of bearish on mechanistic interpretability giving us very complete and satisfying answers about what’s going on in models. Neel Nanda also seems to believe this. For the kind of stuff we can do today, like training SAEs or probes, it seems like the difficulty isn’t that affected by model size. My vague impression is that scaling up to larger models is bottlenecked primarily on compute, but not that much compute.
Maybe if ambitious interpretability works out, a smaller model will be easier to interpret? But the effect doesn’t obviously seem that large to me. Maybe if neuralese lets you make a 2x smaller model, this makes it more than 2x easier to interpret.
Also, it’s unclear to me how much smaller you could make the model with neuralese, if at all. (Some) long-term dependencies can already be expressed in the CoT. Maybe many layers are required to express certain otherwise-inarticulable long-term dependencies? It seems plausible that most useful inarticulable thoughts only require short-term cognition. Maybe the part of the post about “picturing a geometric object and turning it around in one’s mind for minutes at a time” is a counterexample to this.
If it’s possible for the model to reuse the same parameters again and again to improve its own understanding, maybe it’s already doing this alongside the CoT by doing something like “thinking dot by dot” in its CoT. Actually, I started writing this and realized this probably proves too much, because it means that continuous thought is never better than thinking dot by dot. I’m showing my ignorance of the details of transformer architecture, but this is supposed to be off-the-cuff, so I’ll leave it here for now.
Thanks for the response! That makes a lot of sense.
My intuition is that making the model less deep would make it non-linearly easier to interpret, since the last layers of the model is the most interpretable (since it’s closest to human-understable tokens and embeddings), while the earlier layers are interpretable (since they’re farther away). Also if layers end up doing duplicate work that could make interpreting the deduplicated version easier. I’m not sure though.
For “thinking dot by dot”,
I think serial computation can only benefit from roughly as many filler tokens as you have layers, since each layer can only attend to previous layers. For example, if you had a two layer model, position i does some calculations in layer 1, position i+1 can attend to the layer 1 results from position i in layer 2, but at position i+2 there’s no layer 3 to attend to the layer 2 outputs from position i+1.Actually LLMs can’t use filler tokens for sequential reasoning at all.Note how the only difference between position i and i+1 is an additional ”.” input and the positional encoding.
The dot-by-dot paper uses a highly parallel algorithm, so I think it’s more like adding dynamic attention heads / increasing the width of the network, while CoT is more like increasing the number of layers.
If I understand the architecture right (I might not), standard models are already effectively doing neuralese CoT, but bounded by the depth of the network (and also with the caveat that the model has fewer layers remaining to use the results after each step).Edit: I’m working on a longer post about this but I actually think the dot-by-dot paper is evidence against LLMs being about to do most kinds of reasoning with filler tokens.
Thanks for the explanation of how thinking dot by dot works! That makes sense.
Yeah, I’m just reading the abstract of the paper, but it looks like it required very “specific” training on their problem to get it to work. I guess this is why you think it’s evidence against? But maybe with algorithmic improvements, or due to a CoT where the only goal is getting higher reward rather than looking good to the user, LLMs will naturally gain the ability to use filler tokens effectively.
Well, if some reasoning is inarticulable in human language, the circuits implementing that reasoning would probably be difficult to interpret regardless of what layer they appear in the model. If there’s reasoning that is articulable in human language, then presumably the model could just write that reasoning in CoT tokens. Steelman: certain reasoning is almost inarticulable (such that it’s impossible to entirely represent it in CoT), but nonetheless pretty “close” to being articulable. If you reduce the number of layers, then the circuit for that reasoning will end up reusing some of the human-language circuitry, whereas otherwise the whole circuit would have been completely inarticulable.
The best reason I have for why reducing model size could nonlinearly improve interpretability is that there are dependencies between different parts of the network, and so the computation you’d need to understand all of these dependencies is O(n^2) or worse.
Anyway, this still seems pretty speculative. I think this argument would probably affect my actions very little or not at all—like, I still think going from CoT to neuralese would be bad and would want to discourage it. But if I had more understanding of mech interp I would feel less uncertain about it.
The post I was working on is out, “Filler tokens don’t allow sequential reasoning”. I think LLMs are fundamentally incapable of using filler tokens for sequential reasoning (do x and then based on that do y). I also think LLMs are unlikely to stumble upon the algorithm they came up with in the paper through RL.
This is a good point, and I think I’ve been missing part of the tradeoff here. If we force the model to output human-understandable concepts at every step, we encourage its thinking to be human-understandable. Removing that incentive would plausibly make the model smaller, but the additional complexity comes from the model doing interpretability for us (both in making its thinking closer-to-human and in building a pipeline to expose those thoughts as words).
Thanks for the back-and-forth on this, it’s been very helpful!