This looked exciting when you mentioned it, and it doesn’t disappoint.
To check that I get it, here is my own summary:
Because ML looks like the most promising approach to AGI at the moment, we should adapt and/or instantiate the classical arguments for AI risks to a ML context. The main differences are the separation of a training and a deployment phase and the form taken by the objective function (mix of human and automated feedback from data instead of hardcoded function).
(Orthogonality thesis) Even if any combination of goal and intelligence can exist in a mind, the minds created through ML-like training procedure might be ones with specific relations between goals and intelligence. In that context, orthogonality is fundamentally about the training process, and whether there are two independent sub-processes, one for the competence and one for the goals, which can be separated.
(Instrumental Convergence) It matters whether traditional instrumental subgoals like self-preservation emerges during training or during deployment. In the training case, it’s more a problem of inner alignment (understood in the broad sense), because the subgoals will be final for the system; in the deployment case, we fall back on the classic argument about convergent instrumental subgoals.
(Fragility of Value) Here too, whether the classic problem appears during training or deployment matters: if the error on the goal is during training, then the argument is about consequence of outer misalignment; if it’s during deployment, then the argument is about the consequences of approximate alignment.
(Goodhart) Same as the last two points. When the measure/proxy is used during training, the argument is that the resulting system will be optimized for the measure, possibly deciding wrong in extreme situations; when the measure is used during deployment, it’s the resulting AI that will optimize the measure intentionally, leading to potentially stronger and more explicit split between the target and the measure.
I agree that there’s a lot of value in this specialization of the risk arguments to ML. More precisely, I hadn’t thought about the convergent final goals (at least until you mentioned them to me in conversation), and the distinction in the fragility of value seems highly relevant.
I do have a couple of remarks about the post.
So my current default picture of how we will specify goals for AGIs is:
At training time, we identify a method for calculating the feedback to give to the agent, which will consist of a mix of human evaluations and automated evaluations. I’ll call this the objective function. I expect that we will use an objective function which rewards the agent for following commands given to it by humans in natural language.
At deployment time, we give the trained agent commands in natural language. The objective function is no longer used; hopefully the agent instead has internalised a motivation/goal to act in ways which humans would approve of, which leads it to follow our commands sensibly and safely.
This breakdown makes the inner alignment problem a very natural concept—it’s simply the case where the agent’s learned motivations don’t correspond to the objective function used during training.[1] It also makes ambitious approaches to alignment (in which we try to train an AI to be motivated directly by human values) less appealing: it seems strictly easier to train an agent to obey natural language commands in a common-sense way, in which case we get the benefit of continued flexible control during deployment.[2]
This looks like off-line training to me. That’s not a problem per se, but it also means that you have an implicit hypothesis that the AGI will be model-based; otherwise, it would have trouble adapting its behavior after getting new information.
Consider Bostrom’s orthogonality thesis, which states:
Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.
As stated, this is a fairly weak claim: it only talks about which minds are logically possible, rather than minds which we are likely to build.
The original version of this thesis is roughly as follows:
Instrumental convergence thesis: a wide range of the final goals which an AGI could have will incentivise them to pursue certain convergent instrumental subgoals (such as self-preservation and acquiring resources).
However, this again only talks about the final goals which are possible, rather than the ones which are likely to arise in systems we build.
The criticism about possible goals and possible minds seems far more potent for the first case than for the second.
The orthogonality thesis indeed say that ∀(goal,competence),∃M a mind with this goal and this competence. This indeed doesn’t tell us whether the training procedures we use are limited to a specific part of the space of goal and competence pairs.
On the other hand, the instrumental convergence thesis basically says that for almost all goals, the AGI will have the specific convergent instrumental subgoals. If this is true, then this definitely applies to minds trained through ML, as long as their goals fall into the broad category of the thesis. So this thesis is way more potent for trained minds.
This looks like off-line training to me. That’s not a problem per se, but it also means that you have an implicit hypothesis that the AGI will be model-based; otherwise, it would have trouble adapting its behavior after getting new information.
I don’t really know what “model-based” means in the context of AGI. Any sufficiently intelligent system will model the world somehow, even if it’s not trained in a way that distinguishes between a “model” and a “policy”. (E.g. humans weren’t.)
On the other hand, the instrumental convergence thesis basically says that for almost all goals, the AGI will have the specific convergent instrumental subgoals. If this is true, then this definitely applies to minds trained through ML, as long as their goals fall into the broad category of the thesis. So this thesis is way more potent for trained minds.
I’ll steal Ben Garfinkel’s response to this. Suppose I said that “almost all possible ways you might put together a car don’t have a steering wheel”. Even if this is true, it tells us very little about what the cars we actually build might look like, because the process of building things picks out a small subset of all possibilities. (Also, note that the instrumental convergence thesis doesn’t say “almost all goals”, just a “wide range” of them. Edit: oops, this was wrong; although the statement of the thesis given by Bostrom doesn’t say that, he says “almost all” in the previous paragrah.)
This looked exciting when you mentioned it, and it doesn’t disappoint.
To check that I get it, here is my own summary:
I agree that there’s a lot of value in this specialization of the risk arguments to ML. More precisely, I hadn’t thought about the convergent final goals (at least until you mentioned them to me in conversation), and the distinction in the fragility of value seems highly relevant.
I do have a couple of remarks about the post.
This looks like off-line training to me. That’s not a problem per se, but it also means that you have an implicit hypothesis that the AGI will be model-based; otherwise, it would have trouble adapting its behavior after getting new information.
The criticism about possible goals and possible minds seems far more potent for the first case than for the second.
The orthogonality thesis indeed say that ∀(goal,competence),∃M a mind with this goal and this competence. This indeed doesn’t tell us whether the training procedures we use are limited to a specific part of the space of goal and competence pairs.
On the other hand, the instrumental convergence thesis basically says that for almost all goals, the AGI will have the specific convergent instrumental subgoals. If this is true, then this definitely applies to minds trained through ML, as long as their goals fall into the broad category of the thesis. So this thesis is way more potent for trained minds.
Thanks for the feedback! Some responses:
I don’t really know what “model-based” means in the context of AGI. Any sufficiently intelligent system will model the world somehow, even if it’s not trained in a way that distinguishes between a “model” and a “policy”. (E.g. humans weren’t.)
I’ll steal Ben Garfinkel’s response to this. Suppose I said that “almost all possible ways you might put together a car don’t have a steering wheel”. Even if this is true, it tells us very little about what the cars we actually build might look like, because the process of building things picks out a small subset of all possibilities. (
Also, note that the instrumental convergence thesis doesn’t say “almost all goals”, just a “wide range” of them.Edit: oops, this was wrong; although the statement of the thesis given by Bostrom doesn’t say that, he says “almost all” in the previous paragrah.)