Noosphere89 comments on Problems with instruction-following as an alignment target

Noosphere89 15 May 2025 21:27 UTC
6 points
0
For what it’s worth, I consider problem 1 to be somewhat less of a showstopper than you do, because of things like AI control (which while unlikely to scale to arbitrary intelligence levels, is probably useful for the problem of instrumental goals).
However, I do think problems 2 and 3 are a big reason why I’m less of a fan of deploying ASI/AGI widely like @joshc wants to do.
Something close to proliferation concerns (especially around bioweapons) is a big reason why I disagree with @Richard_Ngo on AI safety agreeing to be cooperative with open-source demands/having a cooperative strategy for open-source in the endgame.
Eventually, we will build AIs that could be used safely by small groups, but cannot be released to the public except through locked down APIs with counter-measures to misuse, without everyone or almost everyone dying.
However, I think we can mitigate misuse concerns without requiring much jailbreak robustness, ala @ryan_greenblatt’s post on managing catastrophic misuse without robust AIs:
https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais
I like your thoughts on problem 4, and yeah memory complicates a lot of considerations around alignment in interesting ways.
I agree with you that instruction following should be used as a stepping stone to value alignment, and I even have a specific proposal in mind, which at the moment is the Infra-Bayes Physicalist Super-Imitation.
I agree with your post on this issue, so I’m just listing out more considerations.