mattmacdermott comments on Training a Reward Hacker Despite Perfect Labels

mattmacdermott 15 Aug 2025 20:40 UTC
3 points
0
RL generalization is controlled by why the policy took an action

Is this that good a framing for these experiments? Just thinking out loud:

Distinguish two claims
1. what a model reasons about in its output tokens on the way to getting its answer affects how it will generalise
2. why a model produces its output tokens affects how it will generalise
These experiments seem to test (1), while the claim from your old RL posts is more like (2).

You might want to argue that the claims are actually very similar, but I suspect that someone who disagrees with the quoted statement would believe (1) despite not believing (2). To convince such people we’d have to test (2) directly, or argue that (1) and (2) are very similar.

As for whether the claims are very similar…. I’m actually not sure they are (I changed my mind while writing this comment).

Re: (1), it’s clear that when you get an answer right using a particular line of reasoning, the gradients point towards using that line of reasoning more often. But re: (2), the gradients point towards whatever is the easiest way to get you to produce those output tokens more often, which could in principle be via a qualitatively different computation to the one that actually caused you to output them this time.

So the two claims are at least somewhat different. (1) seems more strongly true than (2) (although I still believe (2) is likely true to a large extent).