Seth Herd comments on We need a field of Reward Function Design

Seth Herd 10 Dec 2025 6:02 UTC
4 points
0
I agree. And I think the same point applies to alignment work on LLM AGI. Even though it’s used for alignment and we expect more of it, there’s not what I’d call a field of reward function design. Most alignment work on LLMs is probing how the few RL alignment attempts work, rather than using different RL functions and seeing what they do. And it doesn’t even seem there’s much theorizing about how alternate reward functions might change current or future more capable LLMs’ alignment.
I think this analogy is pretty strong, and many of the questions are the same, even though the sources of RL signals are pretty different. The reward function for RL on LLMs seems to be more complex. It uses specs or Anthropic’s constitution, and now perhaps the much richer Claude 4.5 Opus’ Soul Document, all as interpreted by another LLM to produce an RL signal. But more RL-agent and brainlike RL functions are pretty complex too, since they’re nontrivial as hardwired, then expressed through a complex environment and a critic/value function that learns a lot. I think there’s a lot of similarity in the questions involved.
So I think your RL training signal starter pack is pretty relevant to LLM AGI alignment theory, too. It’s nice to have those all in one place and some connections drawn out. I hope to comment over there after thinking it through a little more.
And this seems pretty important for LLMs even though they have lots of pretraining which changes the effect of RL dramatically. RL (and cheap knockoff imitations like DPO) is playing an increasingly large role in training recent LLMs. A lot of folks expect it to be critical for further progress on agentic capabilities. I expect something slightly different, self-directed continuous learning, but that would still have a lot of similarities even if it’s not implemented literally as RL.
And RL has arguably always played a large role in LLM alignment. I know you attributed most of LLMs’ alignment to their supervised training magically transmuting observations into behavior. But I think pretraining transmutes observations into potential behavior, and RL posttraining selects which behavior you get, doing the bulk of the alignment work. RL is sort of selecting goals from learned knowledge as Evan Hubinger pointed out on that post.
But more accurately, it’s selecting behavior, and any goals or values are only sort of weakly implicit in that behavior. That’s an important distinction. There’s a lot of that in humans, too, although goals and values are also pursued through more explicit predictions and value function/critic reward estimates.
I’m not sure if it matters for these purposes, but I think the brain is also doing a lot of supervised, predictive learning, and the RL operates on top of that. But the RL also drives behavior and attention, which directs the predictive learning, so it’s a different interaction than the LLMs pretraining-then-RL-to-select-behaviors.
In all, I think LLM descendents will have several relevant similarities to brainlike systems. Which is mostly a bad thing, since the complexities of online RL learning get even more involved in their alignment.