I research intelligence and its emergence and expression in neural networks to ensure advanced AI is safe and beneficial.
I’m currently a Research Scientist at UK AISI working on training and interpreting model organisms of misalignment — such as of reward hacking, evaluation awareness, and sandbagging.
For more, check out my scholar profile and personal website.
Thanks for the comments, here are some responses:
Yes we did not find Olmo-3-7b to learn hacking with beta=0.0, although it did learn in the SDF setting. I don’t find it surprising that with different RL envs or with different hyperparameters one could get it to 100% reward hacking rate though.
We don’t expect the results to be significantly different but we wanted to follow the more usual SFT → RL pipeline so we picked the Instruct-SFT checkpoint.
I’m not sure what you mean by “fixed” here but the no-hack baseline in our plots is the same RL training but with all reward hacks disabled (so the model can only get a reward by actually solving the problem).
That is what we would expect the no-hack baseline to cover.