What I’m saying is that if GPT+DPO creates imitation-based intelligences that can be dangerous due to being intentionally instructed to do something bad (“hey, please kill that guy” and then it kills him), then that’s not particularly concerning from an AI alignment perspective, because it has a similar danger profile to telling humans this. You would still want policy to govern it, similar to how we have policy to govern human-on-human violence, but it’s not the kind of x-risk that notkilleveryoneism is about.
So basically you can have “GPT+DPO is superintelligent, capable and dangerous” without having “GPT+DPO is an x-risk”. That said, I expect GPT+DPO to be stagnate and be replaced by something else, and that something else could be an x-risk (and conditional on the negation of natural impact regularization, I strongly expect it would be).
To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.
Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don’t expect to be able to get to superintelligence without either (1) the “RL” portion of the GPT+RL paradigm playing a much stronger role than it does for current systems, or (2) using some other training paradigm entirely. And the argument for obedience/corrigibility becomes weaker/nonexistent respectively in each of those cases.
Possibly we’re in agreement here? You say you expect GPT+DPO to stagnate and be replaced by something else; I agree with that. I merely happen to think the reason it will stagnate is that its safety properties don’t come free; they’re bought and paid for by a price in capabilities.
Are we using the word “transformative” in the same way? I imagine that if society got reorganized into e.g. AI minds that hire tons of people to continually learn novel tasks that it can then imitate, that would be considered transformative because it would entirely change people’s role in society, like the agricultural revolution did. Whereas right now very few people have jobs that are explicitly about pushing the frontier of knowledge, in the future that might be ~the only job that exists (conditional on GPT+DPO being the future, which again is not a mainline scenario).
What I’m saying is that if GPT+DPO creates imitation-based intelligences that can be dangerous due to being intentionally instructed to do something bad (“hey, please kill that guy” and then it kills him), then that’s not particularly concerning from an AI alignment perspective, because it has a similar danger profile to telling humans this. You would still want policy to govern it, similar to how we have policy to govern human-on-human violence, but it’s not the kind of x-risk that notkilleveryoneism is about.
So basically you can have “GPT+DPO is superintelligent, capable and dangerous” without having “GPT+DPO is an x-risk”. That said, I expect GPT+DPO to be stagnate and be replaced by something else, and that something else could be an x-risk (and conditional on the negation of natural impact regularization, I strongly expect it would be).
To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.
Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don’t expect to be able to get to superintelligence without either (1) the “RL” portion of the GPT+RL paradigm playing a much stronger role than it does for current systems, or (2) using some other training paradigm entirely. And the argument for obedience/corrigibility becomes weaker/nonexistent respectively in each of those cases.
Possibly we’re in agreement here? You say you expect GPT+DPO to stagnate and be replaced by something else; I agree with that. I merely happen to think the reason it will stagnate is that its safety properties don’t come free; they’re bought and paid for by a price in capabilities.
Are we using the word “transformative” in the same way? I imagine that if society got reorganized into e.g. AI minds that hire tons of people to continually learn novel tasks that it can then imitate, that would be considered transformative because it would entirely change people’s role in society, like the agricultural revolution did. Whereas right now very few people have jobs that are explicitly about pushing the frontier of knowledge, in the future that might be ~the only job that exists (conditional on GPT+DPO being the future, which again is not a mainline scenario).