The arguments in the paper are representative of Yoshua’s views rather than mine, so I won’t directly argue for them, but I’ll give my own version of the case against
the distinctions drawn here between RL and the science AI all break down at high levels.
It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you’re training and the outcomes being achieved.
At the top of the spectrum, you have systems trained based on things like the stock price of a company, taking many actions and recieving many observations per second, over years-long trajectories.
Many steps down from that you have RL training of current llms: outcome-based, but with shorter trajectories which are less tightly coupled with the outside world.
And at bottom of the spectrum you have systems which are trained with an objective that depends directly on their outputs and not on the outcomes they cause, with the feedback not being propogated across time very far at all.
At the top of the spectrum, if you train a comptent system it seems almost guaranteed that it’s a powerful agent. It’s a machine for pushing the world into certain configurations. But at the bottom of the spectrum it seems much less likely—its input-output behaviour wasn’t selected to be effective at causing certain outcomes.
Yes there are still ways you could create an agent through a training setup at the bottom of the spectrum (e.g. supervised learning on the outputs of a system at the top of the spectrum), but I don’t think they’re representative. And yes depending on what kind of a system it is you might be able to turn it into an agent using a bit of scaffolding, but if you have the choice not to, that’s an importantly different situation compared to the top of the spectrum.
And yes, it seems possible such setups lead to an agentic shoggoth compeletely by accident—we don’t understand enough to rule that out. But I don’t see how you end up judging the probability that we get a highly agentic system to be more or less the same wherever we are on the spectrum (if you do)? Or perhaps it’s just that you think the distinction is not being handled carefully in the paper?
Ah I should emphasise, I do think all of these things could help—it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I’ll set that aside for now.
It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:
It’s easy to accidentally scaffold some kind of agent out of an oracle as soon as there’s any kind of consistent causal process from the oracle’s outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I’m not totally sure you can easily choose not to
Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals—in a sense this is the premise of offline RL and things like decision transformers
It might be pretty easy for agent-like knowledge to ‘jump the gap’, e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude
It is good to notice the spectrum above. Likely, for a fixed amount of compute/effort, one extreme of this spectrum gets much less agency than the other extreme. Call that the direct effect.
Are there other direct effects? for instance, do you get the same ability to “cure cancer” for a fixed amount of compute/effort across the spectrum? Seems like agency is useful so, probably the ability you get per unit compute is correlated with the agency across this spectrum.
If we are in a setting where an outside force demands you reach a given ability level, then this other indirect effect matters, because it means you will have to use a larger amount of compute.
[optional] To illustrate this problem, consider something that I don’t think people think is safer: instead of using gradient descent, just sample the weights of the neural net at random until you get a low loss. (I am not trying to make an analogy here)
It would be great if someone had a way to compute the “net” effect on agency across the spectrum, also taking into account the indirect path of more compute needed → more compute = more agency across the spectrum. I suspect it might depend on which ability you need to reach, and we might/might not be able to figure it out without experiments.
The arguments in the paper are representative of Yoshua’s views rather than mine, so I won’t directly argue for them, but I’ll give my own version of the case against
It seems commonsense to me that you are more likely to create a dangerous agent the more outcome-based your training signal is, the longer time-horizon those outcomes are measured over, the tighter the feedback loop between the system and the world, and the more of the world lies between the model you’re training and the outcomes being achieved.
At the top of the spectrum, you have systems trained based on things like the stock price of a company, taking many actions and recieving many observations per second, over years-long trajectories.
Many steps down from that you have RL training of current llms: outcome-based, but with shorter trajectories which are less tightly coupled with the outside world.
And at bottom of the spectrum you have systems which are trained with an objective that depends directly on their outputs and not on the outcomes they cause, with the feedback not being propogated across time very far at all.
At the top of the spectrum, if you train a comptent system it seems almost guaranteed that it’s a powerful agent. It’s a machine for pushing the world into certain configurations. But at the bottom of the spectrum it seems much less likely—its input-output behaviour wasn’t selected to be effective at causing certain outcomes.
Yes there are still ways you could create an agent through a training setup at the bottom of the spectrum (e.g. supervised learning on the outputs of a system at the top of the spectrum), but I don’t think they’re representative. And yes depending on what kind of a system it is you might be able to turn it into an agent using a bit of scaffolding, but if you have the choice not to, that’s an importantly different situation compared to the top of the spectrum.
And yes, it seems possible such setups lead to an agentic shoggoth compeletely by accident—we don’t understand enough to rule that out. But I don’t see how you end up judging the probability that we get a highly agentic system to be more or less the same wherever we are on the spectrum (if you do)? Or perhaps it’s just that you think the distinction is not being handled carefully in the paper?
Ah I should emphasise, I do think all of these things could help—it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I’ll set that aside for now.
It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:
It’s easy to accidentally scaffold some kind of agent out of an oracle as soon as there’s any kind of consistent causal process from the oracle’s outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I’m not totally sure you can easily choose not to
Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals—in a sense this is the premise of offline RL and things like decision transformers
It might be pretty easy for agent-like knowledge to ‘jump the gap’, e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude
It is good to notice the spectrum above. Likely, for a fixed amount of compute/effort, one extreme of this spectrum gets much less agency than the other extreme. Call that the direct effect.
Are there other direct effects? for instance, do you get the same ability to “cure cancer” for a fixed amount of compute/effort across the spectrum? Seems like agency is useful so, probably the ability you get per unit compute is correlated with the agency across this spectrum.
If we are in a setting where an outside force demands you reach a given ability level, then this other indirect effect matters, because it means you will have to use a larger amount of compute.
[optional] To illustrate this problem, consider something that I don’t think people think is safer: instead of using gradient descent, just sample the weights of the neural net at random until you get a low loss. (I am not trying to make an analogy here)
It would be great if someone had a way to compute the “net” effect on agency across the spectrum, also taking into account the indirect path of more compute needed → more compute = more agency across the spectrum. I suspect it might depend on which ability you need to reach, and we might/might not be able to figure it out without experiments.