Yeah, that’s interesting to point out, that belief state structures may be more complicated than the underlying state those beliefs represent. That’s difficult to square with my claim that all the information is present in the input, and that network layers can only destroy or change the geometric embedded of the information. Definitely something I want to look into and think about further.
TDA sounds cool. I’d like to take inspiration from it, even if it isn’t a tool that is useful as it is, it may contain good ways to think about things, inspire tools that are useful, or at the very least give insight into things that have been tried and found to not be useful.
I mean, the info is still present in the input? It’s also not more complex that the represented state?
The thing that could’ve been true but doesn’t seem to be, is that transformers might only carry the information required to predict the final token. This is in contrast with the full Bayes-updated belief state. The advantage of the second is that it’s what you need to optimally predict all future tokens.
In other words, if two belief states make the same predictions about what will happen right now, you could’ve thought that transformers wouldn’t be keeping track of the difference. In reality, they seem to.
Yeah, that’s interesting to point out, that belief state structures may be more complicated than the underlying state those beliefs represent. That’s difficult to square with my claim that all the information is present in the input, and that network layers can only destroy or change the geometric embedded of the information. Definitely something I want to look into and think about further.
TDA sounds cool. I’d like to take inspiration from it, even if it isn’t a tool that is useful as it is, it may contain good ways to think about things, inspire tools that are useful, or at the very least give insight into things that have been tried and found to not be useful.
I mean, the info is still present in the input? It’s also not more complex that the represented state?
The thing that could’ve been true but doesn’t seem to be, is that transformers might only carry the information required to predict the final token. This is in contrast with the full Bayes-updated belief state. The advantage of the second is that it’s what you need to optimally predict all future tokens.
In other words, if two belief states make the same predictions about what will happen right now, you could’ve thought that transformers wouldn’t be keeping track of the difference. In reality, they seem to.
Good luck with TDA! The book I had thought looked good last I considered it was Elementary Applied Topology by Robert Ghrist, but looking at it now it seems to be covering applications of topology more broadly. The other book I saw once was Computational Topology: An Introduction and Computational topology for data analysis (which seems less accessible than the previous).