Kaj_Sotala comments on romeo’s Shortform

Kaj_Sotala 28 Apr 2025 9:23 UTC
5 points
0
oFurthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don’t meet the spirit of it.
Thanks for this! This is a good point. Do you think you can go further and say why you think it will be very hard to fix in the near term, so much so that models won’t be useful for AI research?
This is more of an intuition than a rigorous argument, but to try to sketch it out...
For why, basically all the arguments in the old Sequences for why aligning AI should be hard. For a while it seemed like things like the Outcome Pump thought experiment had aged badly, since if you told a modern LLM “get my mother out of the burning building”, it would certainly understand all of the implicit constraints in what you meant by that.
But as noted in Zvi’s post, this seems to be breaking down with the way reasoning models are trained:
This isn’t quite how I’d put it, but directionally yes:
Benjamin Todd: LLMs were aligned by default. Agents trained with reinforcement learning reward hack by default.
Peter Wildeford: this seems to be right – pretty important IMO
Caleb Parikh: I guess if you don’t think RLHF is reinforcement learning and you don’t think Sydney Bing was misaligned then this is right?
Peter Wildeford: yeah that’s a really good point
I think the right characterization is more that LLMs that use current methods (RLHF and RLAIF) largely get aligned ‘to the vibes’ or otherwise approximately aligned ‘by default’ as part of making them useful, which kind of worked for many purposes (at large hits to usefulness). This isn’t good enough to enable them to be agents, but it also isn’t good enough for them figure out most of the ways to reward hack.
Whereas reasoning agents trained with full reinforcement will very often use their new capabilities to reward hack when given the opportunity.
My guess of what’s going on is something like… ordinary LLMs are trained to give the kinds of answers they’d predict an empathetic persona would give. In pretraining, they learn a model of what humans care about that they use to predict text, and then the RLHF pushes them toward something like “give the kinds of answers that a caring and thoughtful person would give”, with “a caring and thoughtful persona” being something that they largely already have a pre-existing concept for, the training just needs to locate it within their persona-space. They were also trained by human raters giving them feedback on the kinds of answers that were good at fulfilling the spirit and not just the letter of the request.
But if you are training them by e.g. automatically generating lots of programming problems and then rewarding them to the proportion that they manage to solve those, then you are only training them on how to solve the letter of the request. The automatic benchmarking reward process doesn’t have a sense of the spirit of the request, it just knows the narrowly-defined objective criteria.
Now the obvious way to deal with that would be to have the answers also be rated by an LLM that measured something like “how well does this meet the spirit of the request”. But then you quickly run into the risk of employing what Zvi calls The Most Forbidden Technique. E.g. the “spirit evaluator” could do things like looking at the chain-of-thought that produced the solution and noticing that the reasoning model it is evaluating mentions cheating. If those kinds of solutions were discouraged, then this would teach the reasoning model to just hide the cheating from the spirit evaluator.
Without the benefit of obvious clues in the chain-of-thought, the spirit evaluator would basically have to verify the reasoning model’s solution itself to see that it meets the spirit of the request… but the spirit evaluator’s ability to verify that e.g. a complex piece of code does what exactly what a human would have wanted it to do probably isn’t better than the reasoning model’s own ability to do that.
To be able to verify that the reasoning model’s solutions meet the spirit of the request, we’d need to train the spirit verifier to be able to tell what solutions do meet the spirit of the request. But if we knew how to do that, would we need the spirit verifier in the first place? After all, the whole problem comes from the fact that just normal RLHF and “aligning the solution to the vibes” doesn’t seem sufficient for solving complicated agentic problems and you need more goal-oriented reasoning that explicitly tackles the objective constraints of the problem in question. (To take the “get my mother out of the burning building” example—current non-reasoning LLMs could certainly tell that you want her out alive and well, but they couldn’t think through a whole step-by-step rescue plan that took into account everything necessary for getting her out safely.)
But we can’t just tell the spirit verifier that “check that the solution meets these objective constraints”, because that’s the same “letter of the goal” objective the reasoning model is being trained with and that the spirit verifier is supposed to do better than.
And of course, all of this is about the kinds of tasks that can be automatically verified and tested. We’ve seen that you can to some extent improve the LLM answers on fuzzier topics by using human raters to turn the fuzzy problem into an objective test. So the LLM gets trained to output the kinds of answers that human raters prefer the most.
Yet naive scores by human raters aren’t necessarily what we want—e.g. more sycophantic models seem to do best in Chatbot Arena. While sycophancy and pleasing the user is no doubt aligned to some of what humans seem to like, we probably don’t want our models doing that. The obvious solution is to then have model answers rated by experts with more sophisticated models of what’s good or correct behavior.
But that raises the question, what if the experts are wrong? The same question applies both for very fuzzy topics like “what kinds of overall values should the LLMs be guided by” and more rigorous ones ranging from “how to evaluate the reliability of research”, “what’s the best nutrition” and “how to interpret this specific nuanced and easy-to-misunderstand concept in evolutionary biology”. In that case, if there are e.g. some specific ways in which particular experts tend to be biased or convincingly give flawed arguments, the LLM that’s told “argue like this kind of imperfect expert would argue” will learn that it should do just that, including vigorously defending that expert’s incorrect reasoning.
So getting the LLMs to actually be aligned with reality on these kinds of fuzzy questions is constrained by our ability to identify the theories and experts who are right. Of course, just getting the LLMs to convincingly communicate the views of our current top experts and best-established theories to a mass audience would probably be an enormous societal benefit! But it does imply that they’re going to provide little in the way of new ideas, if they are just saying the kinds of things that they predict our current experts with their current understanding would say.