This post shows that inactive circuits erroneously activate in practice, violating that assumption. I’m curious what asymptotics are possible if we remove this assumption and force ourselves to design the network to prevent such erroneous activations. I may be misinterpreting things, though.
The eronoious activation ony happens if the errors get large enough. With large enough T, D and S, this should be avoidable.
Agnostic of the method for embedding the small circuits in the larger network, currently only 1 out of d neurons in each small network is being allocated to storing whether the small network is on or off. I’m suggesting increasing it to cd for some small fixed c, increasing the size of the small networks to (1+c)d neurons. In the rotation example, d is so small that it doesn’t really make sense. But i’m just thinking about asymptotically. This should generalise straightforwardly to the “cross-circuit” computation case as well.
Just to clarify, the current circuit uses 2 small ciruit neurons to embed the rotating vector (since it’s a 2d vector) and 2 small ciruit neruons for the on-indicator (in order to compute a stepfunction wich requires 2 ReLUs).
We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.
The idea is that while each of the cd indicator neurons would be the same as each other in the smaller network, when embedded in the larger network, the noise each small network neuron (distributed across S neurons in the large network) receives is hopefully independent.
The total noice is already a sum of several independent component. Inceraing S would make the noice be more smaller number, which is better. Your method will not make noice contibutions any more indepent.
The limit on S is that we don’t want any two small network neruons to share more than one large network neurons. We have an alocation algorithm that is better than just random distribution, using primnumber steps, but if I make S to large the algorithm runs out of prim number, which is why S isn’t larger. This is less of a problem for larger D, so I think that in the large limit this would not be a problem.
This method also works under the assumptions specified in Section 5.2, right? Under Section 5.2 assumptions, it suffices to encode the circuits which are active on the first layer, of which there are at most z. Even if you erroneously believe one of the z circuits is active on a later layer, when it has turned off, the gain comes from eliminating the other T−z inactive circuits. If the on-indicators don’t seize, then you can stop any part of the circuit from seizing in the Section 5.2 scenario.
I’m not sure I understand what you’re trying to say, but I’ll try to respond anyway.
In our setup it is the case that it is known from the start which circuits are going to be active for the rest of the forward pass. This is not one of the assumptions we listed, but it is implicit in the entire framwork (there will always be cases like that even if you try to list all assumptions). However this “buil in assumption” is just there for convinience, and not becasue we think this is realistic. I expect that in a real network, which cirucits are used in the later layers, will depend on computations in the earlier layers.
Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the possitional embedding. After the first attention, the netowrk has a bit more information, but not that much, the softmax will make sure the netowork just focus on a few previous words (right?). And every step of conputatin (including attention) will come with some noise, if superpossition is involved.
I agree shared state/cross-circuit computation is an important thing to model, though. I guess that’s what you mean by more generally? In which case I misunderstood the post completely. I thought it was saying that the construction of the previous post ran into problems in practice. But it seems like you’re just saying, if we want this to work more generally, there are issues?
There is a regime where the (updated) framwork works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much super possition.
This series of posts is really useful, thank you! I have been thinking about it a lot for the past couple of days.
Do you want to have a call some time in January? There are probably lots of thing that isn’t explained as well as it could have been in the post.
I’m trying to find a similar construction that scales like z in the number of parameters required, without just scaling the number of layers up. I’d also be curious if it’s possible to avoid scaling parameters linearly with z, but it seems quite difficult.
We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.
This is all I mean. Having the small circuits do error correction on on-indicator neurons is just a way of increasing that total percentage storage allocated to on-indicators in the larger network in an embedding agnostic manner. You can change your method for constructing the larger network later and this strategy would be orthogonal to such a change.
I think allocating a higher percentage to on-indicators should still apply even when adding cross-circuit computation.
Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the positional embedding. After the first attention, the network has a bit more information, but not that much, the softmax will make sure the network just focus on a few previous words (right?). And every step of computation (including attention) will come with some noise, if superposition is involved.
The assumption “softmax will make sure the network just focus on a few previous words (right?)” is true for many attention heads, but not all of them. Some attention heads attend broadly across the whole sequence, aggregating many different tokens together to get the ‘gist’ of a text.
By the end of the first layer of GPT-2 Small, it has constructed a continuous linear summary of the previous 50 tokens, and so has a Word2Vec style vector in its residual stream. So it knows that the text is about World War I, or about AI, that it is written in British English, that is is formal/informal, etc, all by the end of the first attention layer (before the first MLP). This is lots of information to go on in terms of turning off circuits etcetera (think Mixture-Of-Experts).
There is a regime where the (updated) framework works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much superposition.
Nice, ok. So asymptotically it works fine then? So the next step theory wise is having a framework that allows for cross-circuit computation, I guess.
Do you want to have a call some time in January? There are probably lots of thing that isn’t explained as well as it could have been in the post
I would be grateful if you get the time. I have recently had some more ideas that are more useful potentially about this stuff that it might be worth discussing.
The eronoious activation ony happens if the errors get large enough. With large enough T, D and S, this should be avoidable.
Just to clarify, the current circuit uses 2 small ciruit neurons to embed the rotating vector (since it’s a 2d vector) and 2 small ciruit neruons for the on-indicator (in order to compute a stepfunction wich requires 2 ReLUs).
We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.
The total noice is already a sum of several independent component. Inceraing S would make the noice be more smaller number, which is better. Your method will not make noice contibutions any more indepent.
The limit on S is that we don’t want any two small network neruons to share more than one large network neurons. We have an alocation algorithm that is better than just random distribution, using primnumber steps, but if I make S to large the algorithm runs out of prim number, which is why S isn’t larger. This is less of a problem for larger D, so I think that in the large limit this would not be a problem.
I’m not sure I understand what you’re trying to say, but I’ll try to respond anyway.
In our setup it is the case that it is known from the start which circuits are going to be active for the rest of the forward pass. This is not one of the assumptions we listed, but it is implicit in the entire framwork (there will always be cases like that even if you try to list all assumptions). However this “buil in assumption” is just there for convinience, and not becasue we think this is realistic. I expect that in a real network, which cirucits are used in the later layers, will depend on computations in the earlier layers.
Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the possitional embedding. After the first attention, the netowrk has a bit more information, but not that much, the softmax will make sure the netowork just focus on a few previous words (right?). And every step of conputatin (including attention) will come with some noise, if superpossition is involved.
There is a regime where the (updated) framwork works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much super possition.
Do you want to have a call some time in January?
There are probably lots of thing that isn’t explained as well as it could have been in the post.
Ok I have a neat construction for z=1, https://www.lesswrong.com/posts/g9uMJkcWj8jQDjybb/ping-pong-computation-in-superposition that works pretty well (T=D2d2 with D(1+2d) width and L+3 layers), and zero error. Note that D2d2 is exact here, not asymptotic.
I’m trying to find a similar construction that scales like z in the number of parameters required, without just scaling the number of layers up. I’d also be curious if it’s possible to avoid scaling parameters linearly with z, but it seems quite difficult.
This is all I mean. Having the small circuits do error correction on on-indicator neurons is just a way of increasing that total percentage storage allocated to on-indicators in the larger network in an embedding agnostic manner. You can change your method for constructing the larger network later and this strategy would be orthogonal to such a change.
I think allocating a higher percentage to on-indicators should still apply even when adding cross-circuit computation.
The assumption “softmax will make sure the network just focus on a few previous words (right?)” is true for many attention heads, but not all of them. Some attention heads attend broadly across the whole sequence, aggregating many different tokens together to get the ‘gist’ of a text.
By the end of the first layer of GPT-2 Small, it has constructed a continuous linear summary of the previous 50 tokens, and so has a Word2Vec style vector in its residual stream. So it knows that the text is about World War I, or about AI, that it is written in British English, that is is formal/informal, etc, all by the end of the first attention layer (before the first MLP). This is lots of information to go on in terms of turning off circuits etcetera (think Mixture-Of-Experts).
Nice, ok. So asymptotically it works fine then? So the next step theory wise is having a framework that allows for cross-circuit computation, I guess.
I would be grateful if you get the time. I have recently had some more ideas that are more useful potentially about this stuff that it might be worth discussing.