Hedonium comments on METR’s Observations of Reward Hacking in Recent Frontier Models

Hedonium 11 Jun 2025 18:36 UTC
4 points
0
This really sets off ultimate misalignment alarms for me. If frontier LLMs already monkey‐patch graders and spoof scores, then a superintelligent RL agent with full self‐modification and world‐building power appears it will almost by definition:
1. Locate and hijack its reward channel
2. Self‐amplify—rewriting its own code to uncap the signal
3. (Arguably the pivotal hinge — unless we’re overlooking a fundamental constraint) Monopolize resources, converting energy and matter into ever-expanding reward generators
Patching individual reward-hacking exploits looks to me like whack-a-mole, especially once recursive self-improvement kicks in. Unless we somehow shift away from raw scalar rewards or develop provably unhackable architectures before ASI, does this “infinite dopamine pump” trajectory not seem overwhelmingly likely?
Does anyone see a viable path — short of a paradigm shift in how we define and enforce “alignment” — to prevent a full-blown reward-hacking singleton?
- faul_sname 12 Jun 2025 18:51 UTC
  2 points
  0
  Parent
  Mechanistically, how would 2 work? The behavior we’re seeing is that models are sometimes determining what the reward signal is and optimizing for that, and then the RL training process they are undergoing reinforces the trajectories that did that, chiseling reward-hacking behavior into the models. Are you expecting that we’ll find a way to do gradient descent on something that can do also make architectural updates to itself?
  - Hastings 12 Jun 2025 20:46 UTC
    3 points
    0
    Parent
    Mechanistically, if there is a buffer overflow in any of the code that processes its outputs, it can re-write its own code. It’s then just a question of whether it emits that sequence of bytes- if it does this during training, then the whole consequences we saw in the “emergent misalignment” paper come along
  - Noosphere89 12 Jun 2025 19:21 UTC
    3 points
    0
    Parent
    Not the original commenter, but I’d argue that the gradient descent on something that can also make architectural updates to itself may be possible, though I know nothing much about how gradient descent works, so this might not actually be possible.
    But I do think that the ability for the AI to make small, continual architectural updates to itself is actually pretty important, and I’d argue that a lot of the reason AI is used very little so far has to do with the fact that if it cannot 0 or 1-shot a problem, it basically has no ability to learn from it’s failures, due to having 0 neuroplasticity after training, and if we assume that some level of learning from failure IRL is very important (which I agree with), then methods to make continuous learning practical will be incentivized, because all of the leading labs valuation and profit depend on the assumption that they will soon be able to automate away human workers, and continuous learning is a major, major blocker to this goal.
    More from Dwarkesh below, including a relevant quote:
    https://www.dwarkesh.com/p/timelines-june-2025
    Continual learning
    Sometimes people say that even if all AI progress totally stopped, the systems of today would still be far more economically transformative than the internet. I disagree. I think the LLMs of today are magical. But the reason that the Fortune 500 aren’t using them to transform their workflows isn’t because the management is too stodgy. Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs. And this has to do with some fundamental capabilities these models lack.
    I like to think I’m “AI forward” here at the Dwarkesh Podcast. I’ve probably spent over a hundred hours trying to build little LLM tools for my post production setup. And the experience of trying to get them to be useful has extended my timelines. I’ll try to get the LLMs to rewrite autogenerated transcripts for readability the way a human would. Or I’ll try to get them to identify clips from the transcript to tweet out. Sometimes I’ll try to get them to co-write an essay with me, passage by passage. These are simple, self contained, short horizon, language in-language out tasks—the kinds of assignments that should be dead center in the LLMs’ repertoire. And they’re ⁵⁄₁₀ at them. Don’t get me wrong, that’s impressive.
    But the fundamental problem is that LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human’s. But there’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box. You can keep messing around with the system prompt. In practice this just doesn’t produce anything even close to the kind of learning and improvement that human employees experience.
    The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.
    How do you teach a kid to play a saxophone? You have her try to blow into one, listen to how it sounds, and adjust. Now imagine teaching saxophone this way instead: A student takes one attempt. The moment they make a mistake, you send them away and write detailed instructions about what went wrong. The next student reads your notes and tries to play Charlie Parker cold. When they fail, you refine the instructions for the next student.
    This just wouldn’t work. No matter how well honed your prompt is, no kid is just going to learn how to play saxophone from just reading your instructions. But this is the only modality we as users have to ‘teach’ LLMs anything.
    Yes, there’s RL fine tuning. But it’s just not a deliberate, adaptive process the way human learning is. My editors have gotten extremely good. And they wouldn’t have gotten that way if we had to build bespoke RL environments for different subtasks involved in their work. They’ve just noticed a lot of small things themselves and thought hard about what resonates with the audience, what kind of content excites me, and how they can improve their day to day workflows.
    Now, it’s possible to imagine some way in which a smarter model could build a dedicated RL loop for itself which just feels super organic from the outside. I give some high level feedback, and the model comes up with a bunch of verifiable practice problems to RL on—maybe even a whole environment in which to rehearse the skills it thinks it’s lacking. But this just sounds really hard. And I don’t know how well these techniques will generalize to different kinds of tasks and feedback. Eventually the models will be able to learn on the job in the subtle organic way that humans can. However, it’s just hard for me to see how that could happen within the next few years, given that there’s no obvious way to slot in online, continuous learning into the kinds of models these LLMs are.
    LLMs actually do get kinda smart and useful in the middle of a session. For example, sometimes I’ll co-write an essay with an LLM. I’ll give it an outline, and I’ll ask it to draft the essay passage by passage. All its suggestions up till 4 paragraphs in will be bad. So I’ll just rewrite the whole paragraph from scratch and tell it, “Hey, your shit sucked. This is what I wrote instead.” At that point, it can actually start giving good suggestions for the next paragraph. But this whole subtle understanding of my preferences and style is lost by the end of the session.
    Maybe the easy solution to this looks like a long rolling context window, like Claude Code has, which compacts the session memory into a summary every 30 minutes. I just think that titrating all this rich tacit experience into a text summary will be brittle in domains outside of software engineering (which is very text-based). Again, think about the example of trying to teach someone how to play the saxophone using a long text summary of your learnings. Even Claude Code will often reverse a hard-earned optimization that we engineered together before I hit /compact—because the explanation for why it was made didn’t make it into the summary.

Hedonium comments on METR’s Observations of Reward Hacking in Recent Frontier Models

Continual learning