Seth Herd comments on Problems with instruction-following as an alignment target

Seth Herd 16 May 2025 2:05 UTC
4 points
0
Thanks, that’s all relevant and useful!
Simplest first: I definitely envision a hierarchy of reporting and reviewing questionable requests. That seem like an obvious and cheap route to partly address the jailbreaking/misuse issues.
I’ve also envisioned smarter LLM agents “thinking through” the possible harms of their actions, and you’re right that does need at least a pretty good grasp on human values. Their grasp on human values is pretty good and likely to get better, as you say. I haven’t thought of this as value alignment, though, because I’ve assumed that developers will try to weigh following instructions over adhering to their moral strictures. But you’re right that devs might try a mix of the two, instead of rules-based refusal training. And the value alignment part could become dominant whether by accident or on purpose.
I haven’t worked through attempting to align powerful LLM agents to human values in as much detail as I’ve tried to think through IF alignment. It’s seemed like losing corrigibility is a big enough downside that devs would try to keep IF dominant over any sort of values-based rules or training. But maybe not, and I’m not positive that value alignment couldn’t work. Specifying values in language (a type of Goals selected from learned knowledge: an alternative to RL alignment) has some nontrivial technical challenges but seems much more viable than just training for behavior that looks ethical. (Here’ it’s worth mentioning Steve Byrnes’ non-behaviorist RL in which some measure of representations is part of the reward model, like trying to reward the model only when it’s thinking about doing good for a human).
Language in the way we use it is designed to generalize well, so language might accurately convey goals in terms of human values for full value alignment. But you’d have to really watch for unintended but sensible interpretations—actually your sequence AI, Alignment, and Ethics (esp. 3-5) is the most thorough treatment I know of why something like “be good to everyone” or slightly more careful statements like “give all sentient beings as much empowerment as you can” will likely go very differently than the designer intended—even if the technical implementation goes perfectly!
I don’t think this goes well by default, but I’m not sure an LLM architecture given goals in language very carefully, and designed very carefully to “want” to follow them, couldn’t pull off value alignment.

I also think someone might be tempted to try it pretty quickly after developing AGI. This might happen if it seemed like proliferation of IF AGI was likely to lead to disastrous misuse, so a value-aligned recursively self-improving sovereign was our best shot. Or it might be tried based on worse or less altruistic logic.

Anyway, thanks for your thoughts on the topic.