So the next post in the sequence addresses some elements of what happens when you do RL after supervised learning.
The conclusion (although this is similarly tentative to yours) is that RL will change the final weights in the network more, essentially giving us a system with an agent using the output of a simulator.
I don’t think this is completely incompatible with your suggestion—this process requires simulatory circuits becoming agenty circuits (in so far as these exist, it seems entirely unclear to me whether these concepts make sense at the circuit level rather than being an emergent phenomenon). I think it would be great if we could do some more research on how exactly this process would take place.
I did do a very quick loose experiment to check that the idea that later weights are the ones that tend to change at the end of training, while early weights tend to change early on. This was more of a sanity check than anything else though.