This is a self review. It’s been about 600 days since this was posted and I’m still happy and proud about this post. In terms of what I view as the important message to the readership, the main thing is introducing a framework and way of thinking that connects what is a pretty fuzzy notion of “world model” to the concrete internal structure of neural networks. It does this in a way that is both theoretically clear and amenable to experiments. It provides a way to think about representations in transformers in a general sense, that is quite different than the normal tactic taken in interpretability. One interesting thing given how interp has progressed since then, is that it provides a way to think about geometric structures in the activations of these networks, which is something the field seems to be moving towards (albeit not guided by theory the way this work is).
I spent a lot of time/effort on making the presentation both easy to follow while actually containing the important lessons. I feel this post does a great job at that. (and in fact, others around me were telling me I was taking too long and that I just needed to hit submit—I’m glad I didn’t follow that advice :)
This work has been expanded on in a number of directions (a recent post summarizes three of those direction). We now have Simplex, an entire org that would likely not exist in the way that it does without this post (we got seed funding right before this post, but the ability to fundraise and attract talent were highly dependent on this post)! We are ~10 people and expanding. Starting this org, watching it grow, being part of the building up of a new way to think about/do interpretability has been the greatest intellectual achievement of my life, and again, I do not think this would have occured (at least not the way it did), without this post.
Some of the things we are working on is extending the framework to more complicated (factorizable) generative structures, which gives way to both a very nice story explaining how it is neural nets can cram so much into their activations, and the corresponding geometric relationship to computational structure. We also have a team dedicated to red-teaming the theory and application to experiments in a number of ways, one of which is more precisely testing the distinction between representing next-token vs. far future representaitons. We are also looking for the types of geometric structures our theory predicts in LLMs, and building tools to do unsuprvised finding of these geometric structures in LLMs.
In retrospect there are a few things I would have emphasized or done differently. The most concrete one would have been to include experiments on RRXOR in this writeup, which more directly test/show representations that have information beyond the next token, and that are predicted by the theory.
Another regret is that I haven’t done a good job of keeping up with the public facing side of our work since this post. We made one post about our work since then, and it really does hide a lot of the insights we’ve made internally. This is not great, since I do believe that we have refined our way of thinking about representations and computations in these systems, in a way that the interp community needs.
More personally (and perhaps this isn’t relevant for a review since it really is just about me), this post came after switching fields from neuroscience, where I was for a decade, to interpretability. There was a lot of uncertainty and stress about that move, and quite a bit of wondering if I could actually contribute. So it was a big deal for me to see the community react positively to this work. Additionally, I had been interested in computational mechanics for a decade from within neuroscience, since it felt like it could say something about computation in biological neural nets, but I could never figure out how to get it to do anything useful. So it’s also a privilege to see the dream play out and be realized in neural nets, and to start a research org dedicated to that vision!
The start of the research program presented here is in a real sense a bet, in that it could fail to be useful at scale, for a few different reasons. But the fact that we are building up with both theory and experiment in a way where each informs the other, and that we were able to show the community a small part of what this looks like, is a lesson I hope the readership takes, even beyond the specific content of this post. (hot take) I think a lot of great work happens within this community, but too often the more theoretical and the more empirical approaches don’t talk to eachother as much as they should.
This is a self review. It’s been about 600 days since this was posted and I’m still happy and proud about this post. In terms of what I view as the important message to the readership, the main thing is introducing a framework and way of thinking that connects what is a pretty fuzzy notion of “world model” to the concrete internal structure of neural networks. It does this in a way that is both theoretically clear and amenable to experiments. It provides a way to think about representations in transformers in a general sense, that is quite different than the normal tactic taken in interpretability. One interesting thing given how interp has progressed since then, is that it provides a way to think about geometric structures in the activations of these networks, which is something the field seems to be moving towards (albeit not guided by theory the way this work is).
I spent a lot of time/effort on making the presentation both easy to follow while actually containing the important lessons. I feel this post does a great job at that. (and in fact, others around me were telling me I was taking too long and that I just needed to hit submit—I’m glad I didn’t follow that advice :)
This work has been expanded on in a number of directions (a recent post summarizes three of those direction). We now have Simplex, an entire org that would likely not exist in the way that it does without this post (we got seed funding right before this post, but the ability to fundraise and attract talent were highly dependent on this post)! We are ~10 people and expanding. Starting this org, watching it grow, being part of the building up of a new way to think about/do interpretability has been the greatest intellectual achievement of my life, and again, I do not think this would have occured (at least not the way it did), without this post.
Some of the things we are working on is extending the framework to more complicated (factorizable) generative structures, which gives way to both a very nice story explaining how it is neural nets can cram so much into their activations, and the corresponding geometric relationship to computational structure. We also have a team dedicated to red-teaming the theory and application to experiments in a number of ways, one of which is more precisely testing the distinction between representing next-token vs. far future representaitons. We are also looking for the types of geometric structures our theory predicts in LLMs, and building tools to do unsuprvised finding of these geometric structures in LLMs.
In retrospect there are a few things I would have emphasized or done differently. The most concrete one would have been to include experiments on RRXOR in this writeup, which more directly test/show representations that have information beyond the next token, and that are predicted by the theory.
Another regret is that I haven’t done a good job of keeping up with the public facing side of our work since this post. We made one post about our work since then, and it really does hide a lot of the insights we’ve made internally. This is not great, since I do believe that we have refined our way of thinking about representations and computations in these systems, in a way that the interp community needs.
More personally (and perhaps this isn’t relevant for a review since it really is just about me), this post came after switching fields from neuroscience, where I was for a decade, to interpretability. There was a lot of uncertainty and stress about that move, and quite a bit of wondering if I could actually contribute. So it was a big deal for me to see the community react positively to this work. Additionally, I had been interested in computational mechanics for a decade from within neuroscience, since it felt like it could say something about computation in biological neural nets, but I could never figure out how to get it to do anything useful. So it’s also a privilege to see the dream play out and be realized in neural nets, and to start a research org dedicated to that vision!
The start of the research program presented here is in a real sense a bet, in that it could fail to be useful at scale, for a few different reasons. But the fact that we are building up with both theory and experiment in a way where each informs the other, and that we were able to show the community a small part of what this looks like, is a lesson I hope the readership takes, even beyond the specific content of this post. (hot take) I think a lot of great work happens within this community, but too often the more theoretical and the more empirical approaches don’t talk to eachother as much as they should.