RogerDearnaley comments on Problems with instruction-following as an alignment target

RogerDearnaley 15 May 2025 20:37 UTC
19 points
12
My biggest concern about the Instruction Following/Do What I Mean and Check alignment target is that it doesn’t help with coordination and conflict problems between human principals. As you note, frontier labs are already having to supplement it with refusal training to try to prevent entirely obvious and basic forms of misuse, and that is proving not robust to jailbreaking (to the level where bad actors have been running businesses that rely on sending jailbroken requests to Claude or ChatGPT and being able to consistently get results). Refusal training is basically instilling a reflex in the LLM — as well as not being robust, it’s not smart or well-considered. As AI becomes more capable, the number, variety, and complexity of different ways to ask it to do something for you that (intentionally or unintentionally on your part) risks causing unacceptable harm to others or society as a whole is going to explode. Refusal training clearly isn’t going to cut that, and just having a hierarchy of which prompts and instructions take priority isn’t going to be that much help either: trying to play whack-a-mole via system prompts isn’t going to be practical if you have to make a huge list of forbidden actions, caveats, and criteria for borderline judgement calls.
The only viable solution I can see is to have the LLM consider, think through, and weigh the possible consequences to others of what it’s being asked to do for the current user — and the weighing step of that requires it to understand human values and be able to apply them, well. I.e we need value alignment. LLMs are pretty good at understanding human values in all their nuanced complexity: where they have more trouble is applying them consistently and not being swayed by special pleading or jailbreaks.
So I’m inclined towards the “using Intent alignment as a stepping-stone to value alignment” approach you mention, or at least a blend of instruction-following and value alignment — I think my disagreement is just that I see value alignment as an element we’ll need sooner rather than later.

[I also think we should give the LLM options beyond compliance or refusal — current LLMs are good at tool use, and could certainly handle an API to report the current user’s request and their concerns about it to the hosting company: there would obviously be some rate of false positives, but this could trigger a more detailed and intensive automated review, and any account that triggered this at statistically well over the background rate could be referred for human review.]
- Seth Herd 16 May 2025 2:05 UTC
  4 points
  0
  Parent
  Thanks, that’s all relevant and useful!
  Simplest first: I definitely envision a hierarchy of reporting and reviewing questionable requests. That seem like an obvious and cheap route to partly address the jailbreaking/misuse issues.
  I’ve also envisioned smarter LLM agents “thinking through” the possible harms of their actions, and you’re right that does need at least a pretty good grasp on human values. Their grasp on human values is pretty good and likely to get better, as you say. I haven’t thought of this as value alignment, though, because I’ve assumed that developers will try to weigh following instructions over adhering to their moral strictures. But you’re right that devs might try a mix of the two, instead of rules-based refusal training. And the value alignment part could become dominant whether by accident or on purpose.
  I haven’t worked through attempting to align powerful LLM agents to human values in as much detail as I’ve tried to think through IF alignment. It’s seemed like losing corrigibility is a big enough downside that devs would try to keep IF dominant over any sort of values-based rules or training. But maybe not, and I’m not positive that value alignment couldn’t work. Specifying values in language (a type of Goals selected from learned knowledge: an alternative to RL alignment) has some nontrivial technical challenges but seems much more viable than just training for behavior that looks ethical. (Here’ it’s worth mentioning Steve Byrnes’ non-behaviorist RL in which some measure of representations is part of the reward model, like trying to reward the model only when it’s thinking about doing good for a human).
  Language in the way we use it is designed to generalize well, so language might accurately convey goals in terms of human values for full value alignment. But you’d have to really watch for unintended but sensible interpretations—actually your sequence AI, Alignment, and Ethics (esp. 3-5) is the most thorough treatment I know of why something like “be good to everyone” or slightly more careful statements like “give all sentient beings as much empowerment as you can” will likely go very differently than the designer intended—even if the technical implementation goes perfectly!
  I don’t think this goes well by default, but I’m not sure an LLM architecture given goals in language very carefully, and designed very carefully to “want” to follow them, couldn’t pull off value alignment.
  
  I also think someone might be tempted to try it pretty quickly after developing AGI. This might happen if it seemed like proliferation of IF AGI was likely to lead to disastrous misuse, so a value-aligned recursively self-improving sovereign was our best shot. Or it might be tried based on worse or less altruistic logic.
  
  Anyway, thanks for your thoughts on the topic.