This didn’t address my biggest hesitation about the control agenda: preventing minor disasters from limited AIs could prevent alignment concern/terror—and so indirectly lead to a full takeover once we get an AGI smart enough to circumvent control measures.
To be clear, I’m not against control work, just conflicted about it for this reason.
That’s a good point. That argument applies to prosaic “alignment” research, which seems importantly different (but related to) “real” alignment efforts.
Prosaic alignment thus far is mostly about making the behavior of LLMs align roughly with human goals/intentions. That’s different in type from actually aligning the goals/values of entities that have goals/values with each other. BUT there’s enough overlap that they’re not entirely two different efforts.
I think many current prosaic alignment methods do probably extend to aligning foundation-model-based AGIs. But it’s in a pretty complex way.
So yes, that argument does apply to most “alignment” work, but much of that is probably also progressing on solving the actual problem. Control work is a stopgap measure that could either provide time to solve the actual problem, or mask the actual problem so it’s not solved in time. I have no prediction which because I haven’t made enough gears-level models to apply here.
Edit: I’m almost finished with a large post trying to progress on how prosaic alignment might or might not actually help align takeover-capable AGI agents if they’re based on current foundation models. I’ll try to link it here when it’s out.
WRT control catching them trying to do bad things: yes, good point! Either you or Ryan also had an unfortunately convincing post about how possible it is that an org might not un-deploy a model they caught trying to do very bad things… so that’s complex too.
This is good (see my other comments) but:
This didn’t address my biggest hesitation about the control agenda: preventing minor disasters from limited AIs could prevent alignment concern/terror—and so indirectly lead to a full takeover once we get an AGI smart enough to circumvent control measures.
That argument also applies to alignment research!
Also, control techniques might allow us to catch the AI trying to do bad stuff.
To be clear, I’m not against control work, just conflicted about it for this reason.
That’s a good point. That argument applies to prosaic “alignment” research, which seems importantly different (but related to) “real” alignment efforts.
Prosaic alignment thus far is mostly about making the behavior of LLMs align roughly with human goals/intentions. That’s different in type from actually aligning the goals/values of entities that have goals/values with each other. BUT there’s enough overlap that they’re not entirely two different efforts.
I think many current prosaic alignment methods do probably extend to aligning foundation-model-based AGIs. But it’s in a pretty complex way.
So yes, that argument does apply to most “alignment” work, but much of that is probably also progressing on solving the actual problem. Control work is a stopgap measure that could either provide time to solve the actual problem, or mask the actual problem so it’s not solved in time. I have no prediction which because I haven’t made enough gears-level models to apply here.
Edit: I’m almost finished with a large post trying to progress on how prosaic alignment might or might not actually help align takeover-capable AGI agents if they’re based on current foundation models. I’ll try to link it here when it’s out.
WRT control catching them trying to do bad things: yes, good point! Either you or Ryan also had an unfortunately convincing post about how possible it is that an org might not un-deploy a model they caught trying to do very bad things… so that’s complex too.