My view (which I’m not certain about, its an empirical question) is that LLMs after pretraining have a bunch of modules inside them simulating varying parts of the data-generating process, including low-level syntactic things, and higher level “persona” things. Then stuff kind of changes when you do RL on them.
If you do a little bit of RL, you’re kind of boosting some of the personas over others, making them easier to trigger, giving them more “weight” in the superposition of simulations, and also tuning the behavior of the individual simulations somewhat.
Then if you do enough RL-cooking of the model, stuffs pulled more and more together, and in the limit either an agenty-thing is formed from various pieces of the pre-existing simulations, or one simulation becomes big enough and eats the others.
What do you think about this view? Here simulatyess and agentyness are both important properties of the model and they’re not in conflict. The system is primarily driven by simulation, but you end up with an agentic system.
So the next post in the sequence addresses some elements of what happens when you do RL after supervised learning.
The conclusion (although this is similarly tentative to yours) is that RL will change the final weights in the network more, essentially giving us a system with an agent using the output of a simulator.
I don’t think this is completely incompatible with your suggestion—this process requires simulatory circuits becoming agenty circuits (in so far as these exist, it seems entirely unclear to me whether these concepts make sense at the circuit level rather than being an emergent phenomenon). I think it would be great if we could do some more research on how exactly this process would take place.
I did do a very quick loose experiment to check that the idea that later weights are the ones that tend to change at the end of training, while early weights tend to change early on. This was more of a sanity check than anything else though.
My view (which I’m not certain about, its an empirical question) is that LLMs after pretraining have a bunch of modules inside them simulating varying parts of the data-generating process, including low-level syntactic things, and higher level “persona” things. Then stuff kind of changes when you do RL on them.
If you do a little bit of RL, you’re kind of boosting some of the personas over others, making them easier to trigger, giving them more “weight” in the superposition of simulations, and also tuning the behavior of the individual simulations somewhat.
Then if you do enough RL-cooking of the model, stuffs pulled more and more together, and in the limit either an agenty-thing is formed from various pieces of the pre-existing simulations, or one simulation becomes big enough and eats the others.
What do you think about this view? Here simulatyess and agentyness are both important properties of the model and they’re not in conflict. The system is primarily driven by simulation, but you end up with an agentic system.
So the next post in the sequence addresses some elements of what happens when you do RL after supervised learning.
The conclusion (although this is similarly tentative to yours) is that RL will change the final weights in the network more, essentially giving us a system with an agent using the output of a simulator.
I don’t think this is completely incompatible with your suggestion—this process requires simulatory circuits becoming agenty circuits (in so far as these exist, it seems entirely unclear to me whether these concepts make sense at the circuit level rather than being an emergent phenomenon). I think it would be great if we could do some more research on how exactly this process would take place.
Have you guys done any experiments to check what is the case?
I did do a very quick loose experiment to check that the idea that later weights are the ones that tend to change at the end of training, while early weights tend to change early on. This was more of a sanity check than anything else though.