it’s all there for layer n+1′s attention to process, though. at each new token position added to the end, we get to use the most recent token as the marginal new computation result produced by the previous token position’s forward pass. for a token position t, for each layer n, n cannot read the output of layer n at earlier token i<t, but n+1 can read everything that happened anywhere in the past, and that gathering process is used to refine the meaning of the current token into a new vector. so, you can’t have hidden state build up in the same way, and each token position runs a partially-shared algorithm. but you can have unhidden state build up, and that unhidden state gets you full turing completeness.
(“brain area” equivalent would be “feature subspace” afaik. which is actually a slightly more general concept that also covers when a human brain lights up in ways that aren’t regionally localized)
In layer n, the feed-forward network for token position t will potentially waste time doing things already done in layer n during tokens i<t.
This puts a constraint on the ability of different layers to represent different levels of abstraction, because now both layer n and n+1 need to be able to detect whether something “seems political”, not just layer n.
This means the network needs to be deeper when we have more tokens, because token t needs to wait until layer n+1 to see if token t-1 had the feature “seems political”, and token t+1 needs to wait until layer n+2 to see if token t had the feature “seems political”, and so on.
it’s all there for layer n+1′s attention to process, though. at each new token position added to the end, we get to use the most recent token as the marginal new computation result produced by the previous token position’s forward pass. for a token position t, for each layer n, n cannot read the output of layer n at earlier token i<t, but n+1 can read everything that happened anywhere in the past, and that gathering process is used to refine the meaning of the current token into a new vector. so, you can’t have hidden state build up in the same way, and each token position runs a partially-shared algorithm. but you can have unhidden state build up, and that unhidden state gets you full turing completeness.
(“brain area” equivalent would be “feature subspace” afaik. which is actually a slightly more general concept that also covers when a human brain lights up in ways that aren’t regionally localized)
Does this not mean the following though?
In layer n, the feed-forward network for token position t will potentially waste time doing things already done in layer n during tokens i<t.
This puts a constraint on the ability of different layers to represent different levels of abstraction, because now both layer n and n+1 need to be able to detect whether something “seems political”, not just layer n.
This means the network needs to be deeper when we have more tokens, because token t needs to wait until layer n+1 to see if token t-1 had the feature “seems political”, and token t+1 needs to wait until layer n+2 to see if token t had the feature “seems political”, and so on.