That (on it’s own, without further postulates) is a fully general argument against improving intelligence.
Well, it’s a primarily a statement about capabilities. The intended construal is that if a given system’s capabilities profile permits it to accomplish some sufficiently transformative task, then that system’s capabilities are not limited to only benign such tasks. I think this claim applies to most intelligences that can arise in a physical universe like our own (though necessarily not in all logically possible universes, given NFL theorems): that there exists no natural subclass of transformative tasks that includes only benign such tasks.
(Where, again, the rub lies in operationalizing “transformative” such that the claim follows.)
We have to accept some level of danger inherent in existence; the question is what makes AI particularly dangerous. If this special factor isn’t present in GPT+DPO, then GPT+DPO is not an AI notkilleveryoneism issue.
I’m not sure how likely GPT+DPO (or GPT+RLHF, or in general GPT-plus-some-kind-of-RL) is to be dangerous in the limits of scaling. My understanding of the argument against, is that the base (large language) model derives most (if not all) of its capabilities from imitation, and the amount of RL needed to elicit desirable behavior from that base set of capabilities isn’t enough to introduce substantial additional strategic/goal-directed cognition compared to the base imitative paradigm, i.e. the amount and kinds of training we’ll be doing in practice are more likely to bias the model towards behaviors that were already a part of the base model’s (primarily imitative) predictive distribution, than they are to elicit strategic thinking de novo.
That strikes me as substantially an empirical proposition, which I’m not convinced the evidence from current models says a whole lot about. But where the disjunct I mentioned comes in, isn’t an argument for or against the proposition; you can instead see it as a larger claim that parametrizes the class of systems for which the smaller claim might or might not be true, with respect to certain capabilities thresholds associated with specific kinds of tasks. And what the larger claim says is that, to the extent that GPT+DPO (and associated paradigms) fail to produce reasoners which could (in terms of capability, saying nothing about alignment or “motive”) be dangerous, they will also fail to be “transformative”—which in turn is an issue in precisely those worlds where systems with “transformative” capabilities are economically incentivized over systems without those capabilities (as is another empirical question!).
To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.
Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don’t expect to be able to get to superintelligence without either (1) the “RL” portion of the GPT+RL paradigm playing a much stronger role than it does for current systems, or (2) using some other training paradigm entirely. And the argument for obedience/corrigibility becomes weaker/nonexistent respectively in each of those cases.
Possibly we’re in agreement here? You say you expect GPT+DPO to stagnate and be replaced by something else; I agree with that. I merely happen to think the reason it will stagnate is that its safety properties don’t come free; they’re bought and paid for by a price in capabilities.