Yeah! I think distance, direction, and position (not topology) are at least locally important in semantic spaces, if not globally important, but continuity and connectedness (yes topology) are probably important for understanding the different semantic regions, especially since so much of what neural nets seem to do is warping the spaces in a way that wouldn’t change anything about them from a topological perspective!
subspaces of it where meaningful information might actually be stored
At least for vanilla networks, the input can be embedded into higher dimensions or projected into lower dimensions, so you’re only ever really throwing away information, which I think is an interesting perspective for when thinking about the idea that meaningful information would be stored in different subspaces. It feels to me more like specific kinds of data points (inputs) which had specific locations in the input distribution would, if you projected their activation for some layer into some subspace, tell you something about that input. But whatever it tells you was in the semantic topology of the input distribution, it just needed to be transformed geometrically before you could do a simple projection to a subspace to see it.
Hmm, I’m reminded of the computational mechanics work with their flashy paper finding that transformers’ residual stream represents the geometry of belief state updates (as opposed to, say, just the next token), as found by experimentally finding a predicted fractal in a simple carefully chosen prediction problem. Now, there’s more going on than topology there, and I don’t know if they looked at the topology—but fractals do have interesting topological properties, in case that’s helpful.
I also wonder if there’s a connection to topological data analysis, which looks at some sort of homology. Now, the vibe I tend to get is that basically nobody actually uses TDA in actual practice, even if you can technically ‘apply’ it. But maybe you can find it useful anyways; or maybe I’m just wrong about how much TDA is used in practice.
Yeah, that’s interesting to point out, that belief state structures may be more complicated than the underlying state those beliefs represent. That’s difficult to square with my claim that all the information is present in the input, and that network layers can only destroy or change the geometric embedded of the information. Definitely something I want to look into and think about further.
TDA sounds cool. I’d like to take inspiration from it, even if it isn’t a tool that is useful as it is, it may contain good ways to think about things, inspire tools that are useful, or at the very least give insight into things that have been tried and found to not be useful.
I mean, the info is still present in the input? It’s also not more complex that the represented state?
The thing that could’ve been true but doesn’t seem to be, is that transformers might only carry the information required to predict the final token. This is in contrast with the full Bayes-updated belief state. The advantage of the second is that it’s what you need to optimally predict all future tokens.
In other words, if two belief states make the same predictions about what will happen right now, you could’ve thought that transformers wouldn’t be keeping track of the difference. In reality, they seem to.
Yeah! I think distance, direction, and position (not topology) are at least locally important in semantic spaces, if not globally important, but continuity and connectedness (yes topology) are probably important for understanding the different semantic regions, especially since so much of what neural nets seem to do is warping the spaces in a way that wouldn’t change anything about them from a topological perspective!
At least for vanilla networks, the input can be embedded into higher dimensions or projected into lower dimensions, so you’re only ever really throwing away information, which I think is an interesting perspective for when thinking about the idea that meaningful information would be stored in different subspaces. It feels to me more like specific kinds of data points (inputs) which had specific locations in the input distribution would, if you projected their activation for some layer into some subspace, tell you something about that input. But whatever it tells you was in the semantic topology of the input distribution, it just needed to be transformed geometrically before you could do a simple projection to a subspace to see it.
Hmm, I’m reminded of the computational mechanics work with their flashy paper finding that transformers’ residual stream represents the geometry of belief state updates (as opposed to, say, just the next token), as found by experimentally finding a predicted fractal in a simple carefully chosen prediction problem. Now, there’s more going on than topology there, and I don’t know if they looked at the topology—but fractals do have interesting topological properties, in case that’s helpful.
I also wonder if there’s a connection to topological data analysis, which looks at some sort of homology. Now, the vibe I tend to get is that basically nobody actually uses TDA in actual practice, even if you can technically ‘apply’ it. But maybe you can find it useful anyways; or maybe I’m just wrong about how much TDA is used in practice.
Yeah, that’s interesting to point out, that belief state structures may be more complicated than the underlying state those beliefs represent. That’s difficult to square with my claim that all the information is present in the input, and that network layers can only destroy or change the geometric embedded of the information. Definitely something I want to look into and think about further.
TDA sounds cool. I’d like to take inspiration from it, even if it isn’t a tool that is useful as it is, it may contain good ways to think about things, inspire tools that are useful, or at the very least give insight into things that have been tried and found to not be useful.
I mean, the info is still present in the input? It’s also not more complex that the represented state?
The thing that could’ve been true but doesn’t seem to be, is that transformers might only carry the information required to predict the final token. This is in contrast with the full Bayes-updated belief state. The advantage of the second is that it’s what you need to optimally predict all future tokens.
In other words, if two belief states make the same predictions about what will happen right now, you could’ve thought that transformers wouldn’t be keeping track of the difference. In reality, they seem to.
Good luck with TDA! The book I had thought looked good last I considered it was Elementary Applied Topology by Robert Ghrist, but looking at it now it seems to be covering applications of topology more broadly. The other book I saw once was Computational Topology: An Introduction and Computational topology for data analysis (which seems less accessible than the previous).