https://dtch1997.github.io/
As of Oct 11 2025, I have not signed any contracts that I can’t mention exist. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.
Cool work! Thanks for doing it. I’m currently working on reproducing / extending this and I have some quick qns.
If I’m reading figure 6 correctly, Olmo-3-7b with beta=0.0 (top left, purple) does not result in reward hacking? This is weird to me, I’m observing 100% reward hacking rate after 300 steps in my own training runs
Any reason why you use Olmo-3-7b-Instruct-SFT instead of Olmo-3-7b-Instruct (the checkpoint done after DPO)? Do you expect the results to be different?
Did you compare to models that were trained with a “fixed” reward function that disallowed hacking? I’m interested in this bc it would prove that the reward hacking was the determining factor behind the EM.
Did you compare to models trained on benign data? Just to check that the misalignment you observe isn’t simply from models forgetting their safety training