Seth Herd comments on Problems with instruction-following as an alignment target

Seth Herd 18 May 2025 20:27 UTC
4 points
2
I am definitely thinking of IF as it applies to systems with capability for unlimited autonomy. Intent alignment as a concept doesn’t end at some level of capability—although I think we often assume it would.

How it would understand “the right thing” is the question. But yes, intent alignment as I’m thinking of it does scale smoothly into value alignment plus corrigibility if you can get it right enough.