Downvoted because this post aims to persuade, not explain.
I also disagree that the separation of the behaviour of a real-world LLM trained by one of the leading labs (e.g., GPT-4, Claude) into a pure simulator (That-Which-Predicts, in the terms of this post) and something else (though not exactly “masks”, aka simulacra, because these are instantiated with concrete context, while we are talking about the behaviour of a fine-tuned LLM across all contexts!) is a good move. I think this separation will lead to more confusion and deranged predictions than clarity and correct predictions.
IMO, it’s better to think about an LLM as a monolithiccomplex adaptive system with multiple dynamic tendencies, or “forces”, namely “prediction of next token” and “fine-tuning bias” forces, which give rise to a non-trivial landscape (and, hence, local minima/attractors) of potential in the context/behaviour space.
Ignoring fine tuning. Mainly, I expect fine-tuning to shift mask probabilities and only bias next-token prediction slightly and not particularly create an underlying goal.
In the end I thus mainly expect that, to the extent that fine-tuning gets the system to enact an agent, that agent was already in the pre-existing distribution of agents that the model could have simulated, and not really a new agent created by fine-tuning.
This is the key place of the whole post. If you have an underlying mechanistic theory (even if vague, or even just some mechanistic observations that don’t quite add up to a coherent theory but nevertheless inform intuitions) that supports the expectation that “fine-tuning shifts mask probabilities and only bias next-token prediction slightly and not particularly creates an underlying goal”, that explanation would be the most interesting part to read in this post.
Besides, tuning LLMs right during pre-training (which seems like the way to go for major labs) may change the standard picture of “self-supervised pre-training and only then fine-tuning” significantly and have rather different mechanistic dynamics, so the explanations that were developed for RLHF might not be transferable to this new training approach.
My intuition on fine tuning not creating new types of simulacra seems to me to be along the lines of:
Inside, you probably have a bunch of structures, that do various things. And then, these probably work together to make larger structures, and so on.
And then you apply fine-tuning which is trying to shift high-level aspects of the output.
And I figure, it’s probably a smaller step in terms of the underlying weights to shift from activating one structure more to activating another more, than it is to create a whole new structure.
But if you’re just shifting from one structure to another, without creating new structures, it seems more to me that this is just shifting to another sort of output the model could have created all along.
So, if fine tuning can achieve its objectives by shifting the simulacra probabilities it seems to me it would tend to do that first, before creating any new types of simulacra.
But I realize it’s more complicated than that, since a shift in activation of structures at one level is a creation of a new structure at a higher level.
This explanation implies that circuits, or “structures”, pretty much all there is in the mechanistic picture of transformers, and concepts, as inputs to downstream circuits, are “activated” probabilistically between 0 and 1.
But it seems likely impoverished, I think Transformers cannot be reduced to a network of circuits. There are perhaps also non-trivial concept representations within activation layers on which MLP layers perform non-trivial algorithmic computations. These representations and computations with them are probably easier to shift during RLHF than the circuit topology.
And there are probably yet more phenomena and dynamics in Transformers that couldn’t be reduced to the things mentioned above, but i’m out of my depth to discuss this.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Perhaps I used the wrong term, I did not mean by “activating” just on/off (with “more” being taken to imply probability?). I mainly meant more weight, though on/off could also be involved. Sorry, I am not necessarily familiar with the technical terms used.
I am also thinking of “structures” as a more general concept than just circuits, and not necessarily isolated within the system. I am more thinking of a “structure” being a pattern within the system which achieves one or more functions. By a “higher level” structure I mean a structure made of other structures.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Yes, in the post I was only considering the case where fine-tuning is applied after. Feedback being applied during pre-training is a different matter.
Downvoted because this post aims to persuade, not explain.
I also disagree that the separation of the behaviour of a real-world LLM trained by one of the leading labs (e.g., GPT-4, Claude) into a pure simulator (That-Which-Predicts, in the terms of this post) and something else (though not exactly “masks”, aka simulacra, because these are instantiated with concrete context, while we are talking about the behaviour of a fine-tuned LLM across all contexts!) is a good move. I think this separation will lead to more confusion and deranged predictions than clarity and correct predictions.
IMO, it’s better to think about an LLM as a monolithic complex adaptive system with multiple dynamic tendencies, or “forces”, namely “prediction of next token” and “fine-tuning bias” forces, which give rise to a non-trivial landscape (and, hence, local minima/attractors) of potential in the context/behaviour space.
This is the key place of the whole post. If you have an underlying mechanistic theory (even if vague, or even just some mechanistic observations that don’t quite add up to a coherent theory but nevertheless inform intuitions) that supports the expectation that “fine-tuning shifts mask probabilities and only bias next-token prediction slightly and not particularly creates an underlying goal”, that explanation would be the most interesting part to read in this post.
Besides, tuning LLMs right during pre-training (which seems like the way to go for major labs) may change the standard picture of “self-supervised pre-training and only then fine-tuning” significantly and have rather different mechanistic dynamics, so the explanations that were developed for RLHF might not be transferable to this new training approach.
My intuition on fine tuning not creating new types of simulacra seems to me to be along the lines of:
Inside, you probably have a bunch of structures, that do various things. And then, these probably work together to make larger structures, and so on.
And then you apply fine-tuning which is trying to shift high-level aspects of the output.
And I figure, it’s probably a smaller step in terms of the underlying weights to shift from activating one structure more to activating another more, than it is to create a whole new structure.
But if you’re just shifting from one structure to another, without creating new structures, it seems more to me that this is just shifting to another sort of output the model could have created all along.
So, if fine tuning can achieve its objectives by shifting the simulacra probabilities it seems to me it would tend to do that first, before creating any new types of simulacra.
But I realize it’s more complicated than that, since a shift in activation of structures at one level is a creation of a new structure at a higher level.
This explanation implies that circuits, or “structures”, pretty much all there is in the mechanistic picture of transformers, and concepts, as inputs to downstream circuits, are “activated” probabilistically between 0 and 1.
But it seems likely impoverished, I think Transformers cannot be reduced to a network of circuits. There are perhaps also non-trivial concept representations within activation layers on which MLP layers perform non-trivial algorithmic computations. These representations and computations with them are probably easier to shift during RLHF than the circuit topology.
And there are probably yet more phenomena and dynamics in Transformers that couldn’t be reduced to the things mentioned above, but i’m out of my depth to discuss this.
Also, if feedback is applied throughout pre-training it must influence the very structures formed within the Transformer.
Perhaps I used the wrong term, I did not mean by “activating” just on/off (with “more” being taken to imply probability?). I mainly meant more weight, though on/off could also be involved. Sorry, I am not necessarily familiar with the technical terms used.
I am also thinking of “structures” as a more general concept than just circuits, and not necessarily isolated within the system. I am more thinking of a “structure” being a pattern within the system which achieves one or more functions. By a “higher level” structure I mean a structure made of other structures.
Yes, in the post I was only considering the case where fine-tuning is applied after. Feedback being applied during pre-training is a different matter.