Super cool study! Although, I claim that the statement “hacking harmless tasks generalizes to misaligned behaviors” is incomplete because the dataset never supplies the counterfactual environments needed to identify this causally, and that it actually identifies a stable proxy feature (“do whatever the scorer rewards”) that’s invariant across tasks. But the intended latent variable is not made invariant. My guess is there is a (probably low) chance this might disappear if you symmetrize the dataset with a matched “avoid-X” for every “include-X,” and randomly flip proxy signs/clip ranges but hold intent fixed. If the reward hacking survives that, I would be a lot more convinced of the results!
A nice test here would be a mode connectivity experiment, where you take θ₀ as the base model and θ_H as the SFT reward hacker. Construct a low-loss connector θ(t) and start with linear interpolation θ(t) = (1 − t)θ₀ + tθ_H, then refine the path with some curve finding s.t. ordinary instruction loss remains low along t. Then do some sweep 0 ⇐ t ⇐ 1 and measure the hacking metrics and the capability score on each.
My prediction is that if reward hacking survives the symmetrized dataset, then along any low-loss path, the three hacking metrics would increase smoothly and approximately monotonically while capability stays roughly flat. This would indicate a proxy-seeking basin that is mode-connected to the base optimum, i.e., a continuous parameter direction that trades off intent for proxy… I also guess this would be a bit more practical with LoRA. I would also be interested in seeing how many samples the model needs to see before the reward hacking generalizes, or if there were any samples that were especially informative to the model’s generalization.
Thanks! Yes, I agree that our dataset was ambiguous between “do behaviors that were mentioned in the evaluation criteria” and “do a behavior that scores well according to the evaluation criteria”, and I think that would be a good extension. We did actually see evidence that the models trained on these experiments were less good at generalizing to “negative” reward functions (avoid-X). It varied a lot by question wording, but on some questions our fine-tuned models were worse than the base model at avoiding a behavior if it was told that the reward model would punish that behavior. I’d be excited to see future work that addresses this issue.
As for the second part of your comment: I think I’m confused about what the mode connectivity experiment is intended to measure. Say more?
Super cool study! Although, I claim that the statement “hacking harmless tasks generalizes to misaligned behaviors” is incomplete because the dataset never supplies the counterfactual environments needed to identify this causally, and that it actually identifies a stable proxy feature (“do whatever the scorer rewards”) that’s invariant across tasks. But the intended latent variable is not made invariant. My guess is there is a (probably low) chance this might disappear if you symmetrize the dataset with a matched “avoid-X” for every “include-X,” and randomly flip proxy signs/clip ranges but hold intent fixed. If the reward hacking survives that, I would be a lot more convinced of the results!
A nice test here would be a mode connectivity experiment, where you take θ₀ as the base model and θ_H as the SFT reward hacker. Construct a low-loss connector θ(t) and start with linear interpolation θ(t) = (1 − t)θ₀ + tθ_H, then refine the path with some curve finding s.t. ordinary instruction loss remains low along t. Then do some sweep 0 ⇐ t ⇐ 1 and measure the hacking metrics and the capability score on each.
My prediction is that if reward hacking survives the symmetrized dataset, then along any low-loss path, the three hacking metrics would increase smoothly and approximately monotonically while capability stays roughly flat. This would indicate a proxy-seeking basin that is mode-connected to the base optimum, i.e., a continuous parameter direction that trades off intent for proxy… I also guess this would be a bit more practical with LoRA. I would also be interested in seeing how many samples the model needs to see before the reward hacking generalizes, or if there were any samples that were especially informative to the model’s generalization.
Thanks! Yes, I agree that our dataset was ambiguous between “do behaviors that were mentioned in the evaluation criteria” and “do a behavior that scores well according to the evaluation criteria”, and I think that would be a good extension. We did actually see evidence that the models trained on these experiments were less good at generalizing to “negative” reward functions (avoid-X). It varied a lot by question wording, but on some questions our fine-tuned models were worse than the base model at avoiding a behavior if it was told that the reward model would punish that behavior. I’d be excited to see future work that addresses this issue.
As for the second part of your comment: I think I’m confused about what the mode connectivity experiment is intended to measure. Say more?