David Johnston comments on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

David Johnston 26 Nov 2023 0:00 UTC
1 point
−2
If we are to understand you as arguing for something trivial, then I think it only has trivial consequences. We must add nontrivial assumptions if we want to offer a substantive argument for risk.

Suppose we have a collection of systems of different ability that can all, under some conditions, solve $X$ . Let’s say an “ $X$ -wrench” is an event that defeats systems of lower ability but not systems of higher ability (i.e. prevents them from solving $X$ ).

A system that achieves $X$ with $1 - ϵ$ probability must defeat all $X$ -wrenches but those with a probability of at most $ϵ$ . If the set of events that are $Y$ -wrenches but not $X$ -wrenches has probability $δ$ , then the system can defeat all $Y$ -wrenches but a collection with probability of at most $ϵ + δ$ .

That is, if the challenges involved in achieving $X$ are almost the same as the challenges involved in achieving $Y$ , then something good at achieving $X$ is almost as good at achieving $Y$ (granting the somewhat vague assumptions about general capability baked into the definition of wrenches).

However, if $X$ is something that people basically approve of and $Y$ is something people do not approve of, then I do not think the challenges almost overlap. In particular, to do $Y$ , with high probability you need to defeat a determined opposition, which is not likely to be necessary if you want $X$ . That is: no need to kill everyone with nanotech if your doing what you were supposed to.

In order to sustain the argument for risk, we need to assume that the easiest way to defeat $X$ -wrenches is to learn a much more general ability to defeat wrenches than necessary and apply it to solving $X$ and, furthermore, this ability is sufficient to also defeat $Y$ -wrenches. This is plausible—we do actually find it helpful to build generally capable systems to solve very difficult problems—but also plausibly false. Even highly capable AI that achieves long-term objectives could end up substantially specialised for those objectives.

As an aside, if the set of $Y$ -wrenches includes the gradient updates received during training, then an argument that an $X$ -solver generalises to a $Y$ -solver may also imply that deceptive alignment is likely (alternatively, proving that $X$ -solvers generalise to $Y$ -solvers is at least as hard as proving deceptive alignment).