This explanation implies that circuits, or “structures”, pretty much all there is in the mechanistic picture of transformers, and concepts, as inputs to downstream circuits, are “activated” probabilistically between 0 and 1.
But it seems likely impoverished, I think Transformers cannot be reduced to a network of circuits. There are perhaps also non-trivial concept representations within activation layers on which MLP layers perform non-trivial algorithmic computations. These representations and computations with them are probably easier to shift during RLHF than the circuit topology.
And there are probably yet more phenomena and dynamics in Transformers that couldn’t be reduced to the things mentioned above, but i’m out of my depth to discuss this.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Perhaps I used the wrong term, I did not mean by “activating” just on/off (with “more” being taken to imply probability?). I mainly meant more weight, though on/off could also be involved. Sorry, I am not necessarily familiar with the technical terms used.
I am also thinking of “structures” as a more general concept than just circuits, and not necessarily isolated within the system. I am more thinking of a “structure” being a pattern within the system which achieves one or more functions. By a “higher level” structure I mean a structure made of other structures.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Yes, in the post I was only considering the case where fine-tuning is applied after. Feedback being applied during pre-training is a different matter.
This explanation implies that circuits, or “structures”, pretty much all there is in the mechanistic picture of transformers, and concepts, as inputs to downstream circuits, are “activated” probabilistically between 0 and 1.
But it seems likely impoverished, I think Transformers cannot be reduced to a network of circuits. There are perhaps also non-trivial concept representations within activation layers on which MLP layers perform non-trivial algorithmic computations. These representations and computations with them are probably easier to shift during RLHF than the circuit topology.
And there are probably yet more phenomena and dynamics in Transformers that couldn’t be reduced to the things mentioned above, but i’m out of my depth to discuss this.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Perhaps I used the wrong term, I did not mean by “activating” just on/off (with “more” being taken to imply probability?). I mainly meant more weight, though on/off could also be involved. Sorry, I am not necessarily familiar with the technical terms used.
I am also thinking of “structures” as a more general concept than just circuits, and not necessarily isolated within the system. I am more thinking of a “structure” being a pattern within the system which achieves one or more functions. By a “higher level” structure I mean a structure made of other structures.
Yes, in the post I was only considering the case where fine-tuning is applied after. Feedback being applied during pre-training is a different matter.