I’m surprised I didn’t see here my biggest objection:
MIRI talks about “pivotal acts”, building an AI that’s superhuman in some engineering disciplines (but not generally) and having it do a human-specified thing to halt the development of AGIs (e.g. seek and safely melt down sufficiently large compute clusters) in order to buy time for alignment work. Their main reason for this approach is that it seems less doomed to have an AI specialize in consequentialist reasoning about limited domains of physical engineering than to have it think directly about how its developers’ minds work.
If you are building an alignment researcher, you are building a powerful AI that is directly thinking about misalignment—exploring concepts like humans’ mental blind spots, deception, reward hacking, hiding thoughts from interpretability, sharp left turns, etc. It does not seem wise to build a consequentialist AI and explicitly train it to think about these things, even when the goal is for it to treat them as wrong. (Consider the Waluigi Effect: you may have at least latently constructed the maximally malicious agent!)
I agree, of course, that the biggest news here is the costly commitment—my prior model of them was that their alignment team wasn’t actually respected or empowered, and the current investment is very much not what I would expect them to do if that were the case going forward.
Given the current paradigm and technology it seems far safer to have an AI work on alignment research than highly difficult engineering tasks like nanotech. In particular, note that we only need to have an AI totally obsolete prior effors for this to be as good of a position as we could reasonably hope for.
In the current paradigm, it seem like the AI capability profile for R&D looks reasonably similar to humans.
Then, my overall view is that (for the human R&D capability profile) totally obsoleting alignment progress to date will be much, much easier than developing engineering based hard power necessary for a pivotal act.
This is putting aside the extreme toxicity of directly trying to develop decisive strategic advantage level hard power.
For instance, it’s no concidence that current humans work on advancing alignment research rather than trying to develop hard power themselves...
So, you’ll be able to use considerably dumber systems to do alignment research (merely human level as opposed to vastly superhuman).
Then, my guess is that the reduction in intelligence will dominate world model censorship.
This is putting aside the extreme toxicity of directly trying to develop decisive strategic advantage level hard power.
The pivotal acts that are likely to work aren’t antisocial. My guess is that the reason nobody’s working on them is lack of buy-in (and lack of capacity).
Also, davidad’s Open Agency Architecture is a very concrete example of what such a non-antisocial pivotal act that respects the preferences of various human representatives would look like (i.e. a pivotalprocess).
Perhaps not realistically feasible in its current form, yes, but davidad’s proposal suggests that there might exist such a process, and we just have to keep searching for it.
Yeah, if this wasn’t clear, I was refering to ‘pivotal acts’ which use hard engineering power sufficient for decisive strategic advantage. Things like ‘brain emulations’ or ‘build a fully human interpretable AI design’ don’t seem particularly anti-social (but may be poor ideas for feasiblity reasons).
Agree that current AI paradigm can be used to make significant progress in alignment research if used correctly. I’m thinking something like Cyborgism; leaving most of the “agency” to humans and leveraging prosaic models to boost researcher productivity which, being highly specialized in scope, wouldn’t involve dangerous consequentialist cognition in the trained systems.
However, the problem is that this isn’t what OpenAI is doing—iiuc, they’re planning to build a full-on automated researcher that does alignment research end-to-end, for which orthonormal was pointing out that this is dangerous due to their cognition involving dangerous stuff.
So, leaving aside the problems with other alternatives like pivotal act for now, it doesn’t seem like your points are necessarily inconsistent with orthonormal’s view that OpenAI’s plans (at least in its current form) seem dangerous.
I think OpenAI is probably agnostic about how to use AIs to get more alignment research done.
That said, speeding up human researchers by large multipliers will eventually be required for the plan to be feasible. Like 10-100x rather than 1.5-4x. My guess is that you’ll probably need AIs running considerably autonomously for long stretches to achieve this.
I’m surprised I didn’t see here my biggest objection:
MIRI talks about “pivotal acts”, building an AI that’s superhuman in some engineering disciplines (but not generally) and having it do a human-specified thing to halt the development of AGIs (e.g. seek and safely melt down sufficiently large compute clusters) in order to buy time for alignment work. Their main reason for this approach is that it seems less doomed to have an AI specialize in consequentialist reasoning about limited domains of physical engineering than to have it think directly about how its developers’ minds work.
If you are building an alignment researcher, you are building a powerful AI that is directly thinking about misalignment—exploring concepts like humans’ mental blind spots, deception, reward hacking, hiding thoughts from interpretability, sharp left turns, etc. It does not seem wise to build a consequentialist AI and explicitly train it to think about these things, even when the goal is for it to treat them as wrong. (Consider the Waluigi Effect: you may have at least latently constructed the maximally malicious agent!)
I agree, of course, that the biggest news here is the costly commitment—my prior model of them was that their alignment team wasn’t actually respected or empowered, and the current investment is very much not what I would expect them to do if that were the case going forward.
Given the current paradigm and technology it seems far safer to have an AI work on alignment research than highly difficult engineering tasks like nanotech. In particular, note that we only need to have an AI totally obsolete prior effors for this to be as good of a position as we could reasonably hope for.
In the current paradigm, it seem like the AI capability profile for R&D looks reasonably similar to humans.
Then, my overall view is that (for the human R&D capability profile) totally obsoleting alignment progress to date will be much, much easier than developing engineering based hard power necessary for a pivotal act.
This is putting aside the extreme toxicity of directly trying to develop decisive strategic advantage level hard power.
For instance, it’s no concidence that current humans work on advancing alignment research rather than trying to develop hard power themselves...
So, you’ll be able to use considerably dumber systems to do alignment research (merely human level as opposed to vastly superhuman).
Then, my guess is that the reduction in intelligence will dominate world model censorship.
The pivotal acts that are likely to work aren’t antisocial. My guess is that the reason nobody’s working on them is lack of buy-in (and lack of capacity).
Also, davidad’s Open Agency Architecture is a very concrete example of what such a non-antisocial pivotal act that respects the preferences of various human representatives would look like (i.e. a pivotal process).
Perhaps not realistically feasible in its current form, yes, but davidad’s proposal suggests that there might exist such a process, and we just have to keep searching for it.
Yeah, if this wasn’t clear, I was refering to ‘pivotal acts’ which use hard engineering power sufficient for decisive strategic advantage. Things like ‘brain emulations’ or ‘build a fully human interpretable AI design’ don’t seem particularly anti-social (but may be poor ideas for feasiblity reasons).
Agree that current AI paradigm can be used to make significant progress in alignment research if used correctly. I’m thinking something like Cyborgism; leaving most of the “agency” to humans and leveraging prosaic models to boost researcher productivity which, being highly specialized in scope, wouldn’t involve dangerous consequentialist cognition in the trained systems.
However, the problem is that this isn’t what OpenAI is doing—iiuc, they’re planning to build a full-on automated researcher that does alignment research end-to-end, for which orthonormal was pointing out that this is dangerous due to their cognition involving dangerous stuff.
So, leaving aside the problems with other alternatives like pivotal act for now, it doesn’t seem like your points are necessarily inconsistent with orthonormal’s view that OpenAI’s plans (at least in its current form) seem dangerous.
I think OpenAI is probably agnostic about how to use AIs to get more alignment research done.
That said, speeding up human researchers by large multipliers will eventually be required for the plan to be feasible. Like 10-100x rather than 1.5-4x. My guess is that you’ll probably need AIs running considerably autonomously for long stretches to achieve this.