I wonder if you could get anything interesting by training the activation oracle to predict a target model’s next token from its previous tokens (to learn its personality and capabilities), and then training it to predict the model’s activations from the contents of the context window (to translate that grasp of the model’s behavior into a grasp of how the circuits work). And after that, you could train it to explain activations in natural language, as you do now. The former two stages might help learn latent structure in the model’s activations, which could then be transferred into NLP outputs in the latter stage.
Idk. The real grail here would be training an oracle to tamper with the activations of a model, make predictions about how this will effect the model’s behavior, and learn from feedback, a la the standard scientific method. I kind of expect that would be computationally intractable, though, since the space of ways you can tamper with activations even just within one layer is absolutely massive...
I wonder if you could get anything interesting by training the activation oracle to predict a target model’s next token from its previous tokens (to learn its personality and capabilities), and then training it to predict the model’s activations from the contents of the context window (to translate that grasp of the model’s behavior into a grasp of how the circuits work). And after that, you could train it to explain activations in natural language, as you do now. The former two stages might help learn latent structure in the model’s activations, which could then be transferred into NLP outputs in the latter stage.
Idk. The real grail here would be training an oracle to tamper with the activations of a model, make predictions about how this will effect the model’s behavior, and learn from feedback, a la the standard scientific method. I kind of expect that would be computationally intractable, though, since the space of ways you can tamper with activations even just within one layer is absolutely massive...