Hmm, I’m reminded of the computational mechanics work with their flashy paper finding that transformers’ residual stream represents the geometry of belief state updates (as opposed to, say, just the next token), as found by experimentally finding a predicted fractal in a simple carefully chosen prediction problem. Now, there’s more going on than topology there, and I don’t know if they looked at the topology—but fractals do have interesting topological properties, in case that’s helpful.
I also wonder if there’s a connection to topological data analysis, which looks at some sort of homology. Now, the vibe I tend to get is that basically nobody actually uses TDA in actual practice, even if you can technically ‘apply’ it. But maybe you can find it useful anyways; or maybe I’m just wrong about how much TDA is used in practice.
Yeah, that’s interesting to point out, that belief state structures may be more complicated than the underlying state those beliefs represent. That’s difficult to square with my claim that all the information is present in the input, and that network layers can only destroy or change the geometric embedded of the information. Definitely something I want to look into and think about further.
TDA sounds cool. I’d like to take inspiration from it, even if it isn’t a tool that is useful as it is, it may contain good ways to think about things, inspire tools that are useful, or at the very least give insight into things that have been tried and found to not be useful.
I mean, the info is still present in the input? It’s also not more complex that the represented state?
The thing that could’ve been true but doesn’t seem to be, is that transformers might only carry the information required to predict the final token. This is in contrast with the full Bayes-updated belief state. The advantage of the second is that it’s what you need to optimally predict all future tokens.
In other words, if two belief states make the same predictions about what will happen right now, you could’ve thought that transformers wouldn’t be keeping track of the difference. In reality, they seem to.
Hmm, I’m reminded of the computational mechanics work with their flashy paper finding that transformers’ residual stream represents the geometry of belief state updates (as opposed to, say, just the next token), as found by experimentally finding a predicted fractal in a simple carefully chosen prediction problem. Now, there’s more going on than topology there, and I don’t know if they looked at the topology—but fractals do have interesting topological properties, in case that’s helpful.
I also wonder if there’s a connection to topological data analysis, which looks at some sort of homology. Now, the vibe I tend to get is that basically nobody actually uses TDA in actual practice, even if you can technically ‘apply’ it. But maybe you can find it useful anyways; or maybe I’m just wrong about how much TDA is used in practice.
Yeah, that’s interesting to point out, that belief state structures may be more complicated than the underlying state those beliefs represent. That’s difficult to square with my claim that all the information is present in the input, and that network layers can only destroy or change the geometric embedded of the information. Definitely something I want to look into and think about further.
TDA sounds cool. I’d like to take inspiration from it, even if it isn’t a tool that is useful as it is, it may contain good ways to think about things, inspire tools that are useful, or at the very least give insight into things that have been tried and found to not be useful.
I mean, the info is still present in the input? It’s also not more complex that the represented state?
The thing that could’ve been true but doesn’t seem to be, is that transformers might only carry the information required to predict the final token. This is in contrast with the full Bayes-updated belief state. The advantage of the second is that it’s what you need to optimally predict all future tokens.
In other words, if two belief states make the same predictions about what will happen right now, you could’ve thought that transformers wouldn’t be keeping track of the difference. In reality, they seem to.
Good luck with TDA! The book I had thought looked good last I considered it was Elementary Applied Topology by Robert Ghrist, but looking at it now it seems to be covering applications of topology more broadly. The other book I saw once was Computational Topology: An Introduction and Computational topology for data analysis (which seems less accessible than the previous).