Finetuning LLMs with RL seems to make them more agentic. We will look at the changes RL makes to LLMs’ weights; we can see how localized the changes are, get information about what sorts of computations make something agentic, and make conjectures about selected systems, giving us a better understanding of agency.
Nobody has convinced us this is a bad use of our time, though we’d like to see people try.
I’ll give it a go.
“Agentiness” sounds like a probably pretty complex macro-level property of neural networks, at least to me. As in, the definition of the property seems to itself depend on other macro-level properties and structures in networks we don’t really have decent operationalisations for either yet (e.g. “goals”, “search processes”).
I feel like we’re still at the very beginning of theory in defining and identifying even very mathematically simple macro-level structures in neural networks. We can barely even quantify how much parts of a network interact with other parts of it.
So my guess would be that this sounds too hard to attack directly right now, unless you have some clever guesses already for what “agentiness” in networks looks like, or reason to suspect that “agentiness” is actually a mathematically far simpler property than one might naively think.
Otherwise, I fear your investigation will get lost in trying to identify which of the various changes the parameters of the LLM experience correspond to a change in “agentiness” levels, rather than a change in “capabilities”, a change in “goals”, a change in Moloch knows what, or just to random perturbations.
You could maybe try to control for that by doing lots of other experiments too, like looking at what happens to the parameters of an LLM already trained to be agenty if you train it again to achieve some other goal that doesn’t require learning any new skills, to separate out goal changes. Or what happens to LLMs if they are finetuned to higher performance through methods that don’t involve RL, to separate out capability changes. Or what happens to normal RL agents in the course of normal RL training.
If you combined the data from all of these and found good operationalisations for all the effects and concepts involved, maybe you could separate “agentiness” out from all the other stuff. But at that point, your project would be more like “soloing the Selection Theorems agenda”.
(Which would be very cool if you actually pulled it off, of course)
Further, when it comes to understanding things about properties of neural networks, I don’t feel like we’ve exhausted the low-hanging fruit from looking at very simple models yet. Those are also generally a lot easier and quicker to work with. So I think any time you consider looking at big fancy models to learn something, you should ask yourself if there isn’t equally good progress to be made on your agenda by looking at small, dumb models instead.
The first part of your criticism makes me more excited, not less. We have considered doing the variations you suggested, and more, to distinguish between what parts of the changes are leading to which aspects of behavior.
I also think we can get info without robust operationalizations of concepts involved, but robust operationalizations would certainly allow us to get more info.
I am not one to shy away from hard problems because they’re hard. Especially if it seems increasing hardness levels lead to increasing bits gleaned.
I also think we can get info without robust operationalizations of concepts involved, but robust operationalizations would certainly allow us to get more info.
I think unless you’re extremely lucky and this turns out to be a highly human-visible thing somehow, you’d never notice what you’re looking for among all the other complicated changes happening that nobody has analysis tools or even vaguedefinitions for yet.
Which easier methods do you have in mind?
Dunno. I was just stating a general project-picking heuristic I have, and that it’s eyeing your proposal with some skepticism. Maybe search the literature for simpler problems and models with which you might probe the difference between RL and non-RL training. Something even a shallow MLP can handle, ideally.
Good ideas! I worry that a shallow MLP wouldn’t be capable enough to see a rich signal in the direction of increasing agency, but we should certainly try to do the easy version first.
I think unless you’re extremely lucky and this turns out to be a highly human-visible thing somehow, you’d never notice what you’re looking for among all the other complicated changes happening that nobody has analysis tools or even vaguedefinitions for yet.
I don’t think I’m seeing the complexity you’re seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set’s qualitative influence on behavior. I don’t think this requires rigorous operationalizations.
An example: In a chess-playing context, this will lead to different moves, or out-of-action-space-behavior. The various kinds of out-of-action-space behavior or biases in move changes seem like they’d give us insight into what the head-set was doing, even if we don’t understand the mechanisms used inside the head set.
I don’t think I’m seeing the complexity you’re seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set’s qualitative influence on behavior. I don’t think this requires rigorous operationalizations.
That sounds to me like it would give you a very rough, microscope-level view of all the individual things the training is changing around. I am sceptical that by looking at this ground-level data, you’d be able to separate out the things-that-are-agency from everything else that’s happening.
As an analogy, looking at what happens if you change the wave functions of particular clumps of silica atoms doesn’t help you much in divining how the IBM 608 divides numbers, if you haven’t even worked out yet that the atoms in the machine are clustered into things like transistors and cables, and actually, you don’t even really know how dividing numbers works even on a piece of paper, you just think of division as “the inverse of multiplication”.
I’ll give it a go.
“Agentiness” sounds like a probably pretty complex macro-level property of neural networks, at least to me. As in, the definition of the property seems to itself depend on other macro-level properties and structures in networks we don’t really have decent operationalisations for either yet (e.g. “goals”, “search processes”).
I feel like we’re still at the very beginning of theory in defining and identifying even very mathematically simple macro-level structures in neural networks. We can barely even quantify how much parts of a network interact with other parts of it.
So my guess would be that this sounds too hard to attack directly right now, unless you have some clever guesses already for what “agentiness” in networks looks like, or reason to suspect that “agentiness” is actually a mathematically far simpler property than one might naively think.
Otherwise, I fear your investigation will get lost in trying to identify which of the various changes the parameters of the LLM experience correspond to a change in “agentiness” levels, rather than a change in “capabilities”, a change in “goals”, a change in Moloch knows what, or just to random perturbations.
You could maybe try to control for that by doing lots of other experiments too, like looking at what happens to the parameters of an LLM already trained to be agenty if you train it again to achieve some other goal that doesn’t require learning any new skills, to separate out goal changes. Or what happens to LLMs if they are finetuned to higher performance through methods that don’t involve RL, to separate out capability changes. Or what happens to normal RL agents in the course of normal RL training.
If you combined the data from all of these and found good operationalisations for all the effects and concepts involved, maybe you could separate “agentiness” out from all the other stuff. But at that point, your project would be more like “soloing the Selection Theorems agenda”.
(Which would be very cool if you actually pulled it off, of course)
Further, when it comes to understanding things about properties of neural networks, I don’t feel like we’ve exhausted the low-hanging fruit from looking at very simple models yet. Those are also generally a lot easier and quicker to work with. So I think any time you consider looking at big fancy models to learn something, you should ask yourself if there isn’t equally good progress to be made on your agenda by looking at small, dumb models instead.
The first part of your criticism makes me more excited, not less. We have considered doing the variations you suggested, and more, to distinguish between what parts of the changes are leading to which aspects of behavior.
I also think we can get info without robust operationalizations of concepts involved, but robust operationalizations would certainly allow us to get more info.
I am not one to shy away from hard problems because they’re hard. Especially if it seems increasing hardness levels lead to increasing bits gleaned.
Which easier methods do you have in mind?
I think unless you’re extremely lucky and this turns out to be a highly human-visible thing somehow, you’d never notice what you’re looking for among all the other complicated changes happening that nobody has analysis tools or even vague definitions for yet.
Dunno. I was just stating a general project-picking heuristic I have, and that it’s eyeing your proposal with some skepticism. Maybe search the literature for simpler problems and models with which you might probe the difference between RL and non-RL training. Something even a shallow MLP can handle, ideally.
Good ideas! I worry that a shallow MLP wouldn’t be capable enough to see a rich signal in the direction of increasing agency, but we should certainly try to do the easy version first.
I don’t think I’m seeing the complexity you’re seeing here. For instance, one method we plan on trying is taking sets of heads and MLPs, and reverting them to their og values to see that set’s qualitative influence on behavior. I don’t think this requires rigorous operationalizations.
An example: In a chess-playing context, this will lead to different moves, or out-of-action-space-behavior. The various kinds of out-of-action-space behavior or biases in move changes seem like they’d give us insight into what the head-set was doing, even if we don’t understand the mechanisms used inside the head set.
That sounds to me like it would give you a very rough, microscope-level view of all the individual things the training is changing around. I am sceptical that by looking at this ground-level data, you’d be able to separate out the things-that-are-agency from everything else that’s happening.
As an analogy, looking at what happens if you change the wave functions of particular clumps of silica atoms doesn’t help you much in divining how the IBM 608 divides numbers, if you haven’t even worked out yet that the atoms in the machine are clustered into things like transistors and cables, and actually, you don’t even really know how dividing numbers works even on a piece of paper, you just think of division as “the inverse of multiplication”.