I sadly do not have the capacity to learn about the majority of it.
Sadly, it’s a problem you share with me and most humans, I think, with possible rare exceptions like Paul Erdős.
I’ll try to build up a quick sketch of what the residual stream is, forgive me if I say things that are basic, obtuse, or slightly wrong for brevity.
All neural networks (NN) are built using linear transformations/maps which in NN jargon are called “weights” and non-linear maps called “activation functions”. The output the activation functions are called “activations”. There are also special kinds of maps and operations depending on the “architecture” of the NN (eg: convNet, resNet, LSTM, Transformer).
A vanilla NN is just a series of “layers” consisting of a linear map and then an activation function.
The activation functions are not complicated nonlinear maps, but quite simple to understand. One of the most common, ReLu, can be understood as “for all vectors, leave positive components alone, set negative components to 0” or “project all negative orthants onto the 0 hyperplane”. So, since most of the complex behaviour of NNs is coming from the interplay of the linear maps and these simple nonlinear maps, so linear algebra is a very foundational tool for understanding them.
The transformer architecture is the fanciest new architecture that forms the foundation of modern LLMs which act as the “general pretrained network” for products such as chat-GPT. The architecture is set up with a series of “transformer blocks” each of which has a stack of “attention heads” which is still matrix transformations but set up in a special way, and then a vanilla NN.
The output of each transformer block is summed with the input to use as the input for the next transformer block. The input is called a “residual” from the terminology of resNets. So the transformer block can be thought of as “reading from” and “writing to” a “stream” of residuals passed along from one transformer block to the next like widgets on a conveyor belt, each worker doing their one operation and then letting the widget pass to the next worker.
For a language model, the input to the first transformer block is a sequence of token embeddings representing some sequence of natural language text. The output of the last transformer block is a sequence of predictions for what the next token will be based on the previous ones. So I imagine the residual stream as a high dimensional semantic space, with each transformer block making linear transformations and limited nonlinear transformations to that space to take the semantics from “sequence of words” to “likely next word”.
I am interested in understanding those semantic spaces and think linear algebra, topology, and manifolds are probably good perspectives.
Thanks for your clear explanation, understanding the topology of the space seems fascinating. If it’s a vector space, I would assume its topology is simple, but I can see why you would be interested in the subspaces of it where meaningful information might actually be stored. I imagine that since topology is the most abstract form of geometry, the topological structure would represent some of the most abstract and general ideas the neural network thinks about.
Yeah! I think distance, direction, and position (not topology) are at least locally important in semantic spaces, if not globally important, but continuity and connectedness (yes topology) are probably important for understanding the different semantic regions, especially since so much of what neural nets seem to do is warping the spaces in a way that wouldn’t change anything about them from a topological perspective!
subspaces of it where meaningful information might actually be stored
At least for vanilla networks, the input can be embedded into higher dimensions or projected into lower dimensions, so you’re only ever really throwing away information, which I think is an interesting perspective for when thinking about the idea that meaningful information would be stored in different subspaces. It feels to me more like specific kinds of data points (inputs) which had specific locations in the input distribution would, if you projected their activation for some layer into some subspace, tell you something about that input. But whatever it tells you was in the semantic topology of the input distribution, it just needed to be transformed geometrically before you could do a simple projection to a subspace to see it.
Hmm, I’m reminded of the computational mechanics work with their flashy paper finding that transformers’ residual stream represents the geometry of belief state updates (as opposed to, say, just the next token), as found by experimentally finding a predicted fractal in a simple carefully chosen prediction problem. Now, there’s more going on than topology there, and I don’t know if they looked at the topology—but fractals do have interesting topological properties, in case that’s helpful.
I also wonder if there’s a connection to topological data analysis, which looks at some sort of homology. Now, the vibe I tend to get is that basically nobody actually uses TDA in actual practice, even if you can technically ‘apply’ it. But maybe you can find it useful anyways; or maybe I’m just wrong about how much TDA is used in practice.
Yeah, that’s interesting to point out, that belief state structures may be more complicated than the underlying state those beliefs represent. That’s difficult to square with my claim that all the information is present in the input, and that network layers can only destroy or change the geometric embedded of the information. Definitely something I want to look into and think about further.
TDA sounds cool. I’d like to take inspiration from it, even if it isn’t a tool that is useful as it is, it may contain good ways to think about things, inspire tools that are useful, or at the very least give insight into things that have been tried and found to not be useful.
I mean, the info is still present in the input? It’s also not more complex that the represented state?
The thing that could’ve been true but doesn’t seem to be, is that transformers might only carry the information required to predict the final token. This is in contrast with the full Bayes-updated belief state. The advantage of the second is that it’s what you need to optimally predict all future tokens.
In other words, if two belief states make the same predictions about what will happen right now, you could’ve thought that transformers wouldn’t be keeping track of the difference. In reality, they seem to.
Sadly, it’s a problem you share with me and most humans, I think, with possible rare exceptions like Paul Erdős.
I’ll try to build up a quick sketch of what the residual stream is, forgive me if I say things that are basic, obtuse, or slightly wrong for brevity.
All neural networks (NN) are built using linear transformations/maps which in NN jargon are called “weights” and non-linear maps called “activation functions”. The output the activation functions are called “activations”. There are also special kinds of maps and operations depending on the “architecture” of the NN (eg: convNet, resNet, LSTM, Transformer).
A vanilla NN is just a series of “layers” consisting of a linear map and then an activation function.
The activation functions are not complicated nonlinear maps, but quite simple to understand. One of the most common, ReLu, can be understood as “for all vectors, leave positive components alone, set negative components to 0” or “project all negative orthants onto the 0 hyperplane”. So, since most of the complex behaviour of NNs is coming from the interplay of the linear maps and these simple nonlinear maps, so linear algebra is a very foundational tool for understanding them.
The transformer architecture is the fanciest new architecture that forms the foundation of modern LLMs which act as the “general pretrained network” for products such as chat-GPT. The architecture is set up with a series of “transformer blocks” each of which has a stack of “attention heads” which is still matrix transformations but set up in a special way, and then a vanilla NN.
The output of each transformer block is summed with the input to use as the input for the next transformer block. The input is called a “residual” from the terminology of resNets. So the transformer block can be thought of as “reading from” and “writing to” a “stream” of residuals passed along from one transformer block to the next like widgets on a conveyor belt, each worker doing their one operation and then letting the widget pass to the next worker.
For a language model, the input to the first transformer block is a sequence of token embeddings representing some sequence of natural language text. The output of the last transformer block is a sequence of predictions for what the next token will be based on the previous ones. So I imagine the residual stream as a high dimensional semantic space, with each transformer block making linear transformations and limited nonlinear transformations to that space to take the semantics from “sequence of words” to “likely next word”.
I am interested in understanding those semantic spaces and think linear algebra, topology, and manifolds are probably good perspectives.
Thanks for your clear explanation, understanding the topology of the space seems fascinating. If it’s a vector space, I would assume its topology is simple, but I can see why you would be interested in the subspaces of it where meaningful information might actually be stored. I imagine that since topology is the most abstract form of geometry, the topological structure would represent some of the most abstract and general ideas the neural network thinks about.
Yeah! I think distance, direction, and position (not topology) are at least locally important in semantic spaces, if not globally important, but continuity and connectedness (yes topology) are probably important for understanding the different semantic regions, especially since so much of what neural nets seem to do is warping the spaces in a way that wouldn’t change anything about them from a topological perspective!
At least for vanilla networks, the input can be embedded into higher dimensions or projected into lower dimensions, so you’re only ever really throwing away information, which I think is an interesting perspective for when thinking about the idea that meaningful information would be stored in different subspaces. It feels to me more like specific kinds of data points (inputs) which had specific locations in the input distribution would, if you projected their activation for some layer into some subspace, tell you something about that input. But whatever it tells you was in the semantic topology of the input distribution, it just needed to be transformed geometrically before you could do a simple projection to a subspace to see it.
Hmm, I’m reminded of the computational mechanics work with their flashy paper finding that transformers’ residual stream represents the geometry of belief state updates (as opposed to, say, just the next token), as found by experimentally finding a predicted fractal in a simple carefully chosen prediction problem. Now, there’s more going on than topology there, and I don’t know if they looked at the topology—but fractals do have interesting topological properties, in case that’s helpful.
I also wonder if there’s a connection to topological data analysis, which looks at some sort of homology. Now, the vibe I tend to get is that basically nobody actually uses TDA in actual practice, even if you can technically ‘apply’ it. But maybe you can find it useful anyways; or maybe I’m just wrong about how much TDA is used in practice.
Yeah, that’s interesting to point out, that belief state structures may be more complicated than the underlying state those beliefs represent. That’s difficult to square with my claim that all the information is present in the input, and that network layers can only destroy or change the geometric embedded of the information. Definitely something I want to look into and think about further.
TDA sounds cool. I’d like to take inspiration from it, even if it isn’t a tool that is useful as it is, it may contain good ways to think about things, inspire tools that are useful, or at the very least give insight into things that have been tried and found to not be useful.
I mean, the info is still present in the input? It’s also not more complex that the represented state?
The thing that could’ve been true but doesn’t seem to be, is that transformers might only carry the information required to predict the final token. This is in contrast with the full Bayes-updated belief state. The advantage of the second is that it’s what you need to optimally predict all future tokens.
In other words, if two belief states make the same predictions about what will happen right now, you could’ve thought that transformers wouldn’t be keeping track of the difference. In reality, they seem to.
Good luck with TDA! The book I had thought looked good last I considered it was Elementary Applied Topology by Robert Ghrist, but looking at it now it seems to be covering applications of topology more broadly. The other book I saw once was Computational Topology: An Introduction and Computational topology for data analysis (which seems less accessible than the previous).