CEO at Redwood Research.
AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
If we are ever arguing on LessWrong and you feel like it’s kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I’ll probably be willing to call to discuss briefly.
I agree with your main point and I agree that this point seems curiously underrated by AI safety people.
I don’t understand whether by “alignment” you mean:
Indefinitely scalable alignment: techniques that aim to be robust for AIs of any capability level
Alignment of any superintelligences: techniques that aim to work for the earliest superintelligent AIs. This might be much easier and is plausibly sufficient (as long as these AIs have enough time to develop techniques to align their successors).
Either way, I think it probably makes sense to use a more specific term than “alignment” for this problem: it’s so natural to talk about whether non-superintelligent systems are aligned, and the alignment of non-superintelligence is IMO important for AI risk.
As another note, “aligned with human values” is maybe pretty different from “follow human instructions”. I think ARC’s intended techniques are agnostic to whether their model is truly in its heart aligned, they just want to make models that follow their spec. So e.g. a model that is a paperclipper but will never act on its paperclipping urges would be fine.
I don’t know why you think debate work at GDM counts as alignment if you don’t think that various other random prosaic alignment stuff counts. Debate is clearly not indefinitely scalable and in any case it is a technique for generating rewards, which doesn’t suffice for alignment unless you make dubious assumptions or use some other technique on top.