It appears likely that models that reward hack know that humans would disapprove of what they are doing, even if they do not verbalize it. If this is the case, are there training or prompting methods that increase the rate at which models verbalize their reward hacking thinking? This is similar to existing work on encouraging CoT faithfulness, but with a focus on reward hacking.
OpenAI published a paper on this (Detecting misbehavior in frontier reasoning models | OpenAI) - which I found interesting. Essentially, the RL process works by modifying gradient descent/optimization to adjust weights based on maximizing reward. This leads to an interesting undesirable effect.
When researchers applied a reward penalty to reasoning traces that indicated reward hacking would happen, over time, the model began to reason completely benignly and then reward hack anyway.
OpenAI theorize that this may be due to the model starting to reason in a steganographic manner and essentially developing token codes in its reasoning that trigger reward hacking, while appearing and reading benignly to human eyes.
My initial reaction was somewhat emotive, but upon reflection, it seems fairly inevitable to me due to mathematical optimization that this behavior will arise. If it is computationally more efficient to lie and execute the hack, than reason honestly and be penalized, or play the game honestly, then the dishonest reasoning and execution of the hack will be where the model will inherently be driven to land during RL.
Or put another way, if 1) playing chess is hard, and my reward may be +3 if I win and 0 if I lose, and if 2) trying to cheat, and getting caught, and winning my reward might be −2, but if 3) cheating and winning without getting caught is +8, eventually, one of my reasoning traces will result in condition 3 arising naturally. In this event, back propagation will result in the weights adjusting, causing more condition 3s to arise over time, until that is the rule by which the model reasons.
OpenAI acknowledge this issue, and essentially came to the conclusion that the antedote to it is to use ‘less capable models’ in terms of length of time spent in RL.
This was a great Post by the way, thank you. I enjoyed your perspectives.
OpenAI published a paper on this (Detecting misbehavior in frontier reasoning models | OpenAI) - which I found interesting. Essentially, the RL process works by modifying gradient descent/optimization to adjust weights based on maximizing reward. This leads to an interesting undesirable effect.
When researchers applied a reward penalty to reasoning traces that indicated reward hacking would happen, over time, the model began to reason completely benignly and then reward hack anyway.
OpenAI theorize that this may be due to the model starting to reason in a steganographic manner and essentially developing token codes in its reasoning that trigger reward hacking, while appearing and reading benignly to human eyes.
My initial reaction was somewhat emotive, but upon reflection, it seems fairly inevitable to me due to mathematical optimization that this behavior will arise. If it is computationally more efficient to lie and execute the hack, than reason honestly and be penalized, or play the game honestly, then the dishonest reasoning and execution of the hack will be where the model will inherently be driven to land during RL.
Or put another way, if 1) playing chess is hard, and my reward may be +3 if I win and 0 if I lose, and if 2) trying to cheat, and getting caught, and winning my reward might be −2, but if 3) cheating and winning without getting caught is +8, eventually, one of my reasoning traces will result in condition 3 arising naturally. In this event, back propagation will result in the weights adjusting, causing more condition 3s to arise over time, until that is the rule by which the model reasons.
OpenAI acknowledge this issue, and essentially came to the conclusion that the antedote to it is to use ‘less capable models’ in terms of length of time spent in RL.
This was a great Post by the way, thank you. I enjoyed your perspectives.