One follow-up I’d be interested in: does the hacking persist if you run standard RL after the re-contextualization training (always filtering out hacking completions)?
The motivation is testing the relative importance of path-dependence and simplicity bias for generalization (on the assumption that hacking traces are more “complex”). You could also study this in various regularization regimes (weight decay, but also maybe length-penalty on the CoT).
Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure.
Cool results!
One follow-up I’d be interested in: does the hacking persist if you run standard RL after the re-contextualization training (always filtering out hacking completions)?
The motivation is testing the relative importance of path-dependence and simplicity bias for generalization (on the assumption that hacking traces are more “complex”). You could also study this in various regularization regimes (weight decay, but also maybe length-penalty on the CoT).
Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure.
Yup the latter (post-recontextualized-training model)