https://dtch1997.github.io/
As of Oct 11 2025, I have not signed any contracts that I can’t mention exist. I’ll try to update this statement at least once a year, so long as it’s true. I added this statement thanks to the one in the gears to ascension’s bio.
Is anyone studying reward hacking generalization? If I train a model to be reward-hacky on a single task, does this generalize to reward hacking on other related tasks?
Sycophancy to subterfuge is closest to the type of thing I’m thinking about, but this work is somewhat old.
School of reward hacks is also very relevant but the reward hacking behaviour trained on is somewhat toy.