Now this is truly interesting. As other commentators, and the article itself, have pointed out, it’s very difficult to discern true introspection (the model tracking and reasoning effectively about its internal activations) from more mundane explanations (activating internal neurons related to bread make a bread-related output more likely, and caused the model to be less certain later on that the original text was not somehow related to bread until it tried to reason through the connection manually).
I remember a paper on world-modeling, where the authors proved pretty concretely that their LLM had a working world model by training a probe to predict the future world state of a gridworld robot programming environment based on the model’s last layer activations after each command it entered, and then showing that a counterfactual future world state, in which rotation commands had reversed meanings, could not be similarly predicted. I’m not entirely sure how I would apply the same approach here, but it seems broadly pertinent in that it’s likewise a very difficult-to-nail-down capability. Unfortunate that “train a probe to predict the model’s internal state given the model’s internal state” is a nonstarter.
In this experiment, the most interesting phenomenon to explain is not how the model correctly identifies the injected concept, but rather how it correctly notices that there is an injected concept in the first place.
This was my takeaway too. The explanation of an LLM learning to model its future activations and using the accuracy of this model to determine the entropy of a body of text seems compelling. I wonder where the latent “anomaly” activations found by comparing the set of activations for true positives in this experiment with the activations for the control group would be, and whether they’d be related to the activation patterns associated with any other known concepts.
Now this is truly interesting. As other commentators, and the article itself, have pointed out, it’s very difficult to discern true introspection (the model tracking and reasoning effectively about its internal activations) from more mundane explanations (activating internal neurons related to bread make a bread-related output more likely, and caused the model to be less certain later on that the original text was not somehow related to bread until it tried to reason through the connection manually).
I remember a paper on world-modeling, where the authors proved pretty concretely that their LLM had a working world model by training a probe to predict the future world state of a gridworld robot programming environment based on the model’s last layer activations after each command it entered, and then showing that a counterfactual future world state, in which rotation commands had reversed meanings, could not be similarly predicted. I’m not entirely sure how I would apply the same approach here, but it seems broadly pertinent in that it’s likewise a very difficult-to-nail-down capability. Unfortunate that “train a probe to predict the model’s internal state given the model’s internal state” is a nonstarter.
This was my takeaway too. The explanation of an LLM learning to model its future activations and using the accuracy of this model to determine the entropy of a body of text seems compelling. I wonder where the latent “anomaly” activations found by comparing the set of activations for true positives in this experiment with the activations for the control group would be, and whether they’d be related to the activation patterns associated with any other known concepts.
Is this the paper you’re talking about : https://arxiv.org/abs/2309.00941
Nope—that’s not the one. Did a bit of searching, and it’s here.
Thanks!