Seth Herd comments on The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Seth Herd 31 May 2025 20:12 UTC
7 points
0
This is an interesting analysis. I think even asking whether inner alignment is solved is an overstatement, but this type of proposal is worth some serious consideration. It might very well be part of our first attempts at aligning real AGI. And those attempts don’t seem obviously doomed—so figuring out if they are seems like a high priority.
The short version:
Maybe this works better than RL to avoid inner misalignment—if you can get people to do this instead of RL. However, supervised learning may still create inner misalignment if used for the same purposes as RL would be. There’s an excellent LW post making a technical argument that the two are computationally equivalent in important ways if they’re used for similar purposes—but I can’t find it in my notes or by searching.
Even if it does work for to align an LLM, we’re not guaranteed an aligned agent built around that LLM. Once it starts to learn and reflect, its effective values and goals will change unpredictably. You can put in initial behavioral tendencies that imply values and goals, but learning includes learning new interpretations of values and new goals. Those might remain aligned if the initial alignment is good enough, but they might easily not.
I hope the approach you’re describing here gets more careful analysis; thanks for keeping on writing about it!
The longer version:
Is avoiding RL and curating the predictive learning dataset a route to solving misalignment?
I don’t think this entirely avoids possible inner alignment failures. There’s not a sharp computational distinction between RL and predictive (supervised) learning if employed for the same purposes. A dataset for predictive learning intended to produce actions humans like, containing similar content to what RLHF would upvote, would probably yield similar results, including inner misalignment where the model learns “agree with and flatter humans” instead of our intended long-term goals. More sinister inner alignment failures involve an intelligent agent feigning compliance to protect misaligned goals. I’m not that worried about this with LLMs via RLHF, as being helpful seems simpler than forming that cognition, and I expect the base model not to have goals it wants to protect. But I only skip worrying about that because I can see so many other failure modes.
The alignment stability problem:
I’ve put this last because you might consider it out of scope; you could categorize this as an inner misalignment risk or put it in a different category.
This “bitter lesson” approach of curating the dataset for base model training might seems like it should be helpful for initial inner alignment. Even if we do solve inner misalignment. framing mostly addresses a static system, not a reflective mind that can learn, deliberate, form new beliefs, and so evolve its reflective values/goals. Solving inner and outer alignment problems by this definition doesn’t solve the alignment stability problem.
I think you’re aware of this problem, and you are more optimistic. Exploring that difference of opinion would be valuable, since it seems likely that alignment will be tried using similar methods, based on optimism much like yours. Whether it works as you expect, or fails as I expect by default, could well be one of the tipping points for whether we flourish or perish.
You touched on what I call the alignment stability problem in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?. (I reread this before writing this comment, and I think it’s really valuable as an explicit statement of the hopes many people have for prosaic alignment efforts.)
So situational awareness by an LLM-simulated agent, where they are aware of the fact that they are an LLM-simulated agent, if they reflect logically upon the consequences of this and internalize them, and they can then retain the effects of this between prompt invocations, is going to significantly weaken several of these motivators, but not all of them.
The type of agent you’re proposing has what we might loosely call an aligned thought generator (the LLM with the curated dataset). So the agent has “pure thoughts.” But it will learn new things and form new beliefs, some with the functional property of goals (e.g., “I should save each spreadsheet...” or “I should make sure the humans don’t stop me”).
This agent will have goals because instrumental goals are effective. Humans will give it something resembling goals in the initial training set (instruction-following is pragmatically likely). The agent will both interpret its goal representations through new learning (e.g., “LLM agents like me are actually people according to what humans mean by ‘people’”) and create new subgoals (e.g., “to follow my goal of making my user money without breaking any laws, I need to hide my plans from them since they won’t like how I’m making money”).
I have a really hard time guessing where this goes. I hope a smart base LLM will understand the world well enough and have core values trained in thoroughly and subtly enough that they won’t be radically re-interpreted as the agent learns. But I can also see this going very badly. This deserves a lot more thought, because this is what developers are likely to try. I’ve written a little about this in LLM AGI will have memory, and memory changes alignment, but that just states the problem. Intuitions seem to vary widely.
In sum, the “bitter lesson alignment” you’re advocating seems like a useful part of a hodgepodge approach to alignment, but many questions remain. I think both the inner misalignment from RL or supervised learning needs more detailed analysis. I see a lot of alignment researchers assuming that inner misalignment is essentially inevitable, and others assuming it’s unlikely. This is a problem for the field. The “alignment stability problem” (how effective alignment might change in an AGI that can learn and form new beliefs) gets even less consideration. We’re understaffed, so we need to work more efficiently than science usually does.
- RogerDearnaley 2 Jun 2025 10:08 UTC
  2 points
  0
  Parent
  [Seth, I owe you a reply to your lengthy and thoughtful comment — I aim to get to this in the next day or two.]