Oliver Daniels comments on Evaluating Oversight Robustness with Incentivized Reward Hacking

Oliver Daniels 28 Apr 2025 18:40 UTC
2 points
0
This is really cool work and I hope it gets more attention (e.g. promoted to the Alignment Forum). In particular I love the use of CodeNames as a relatively low-compute test-bed for scalable oversight work.

A view thoughts:
- There are roughly 2 motivations one can have for adversarial evals of scalable oversight methods:
  1) identify and patch reward-hacking opportunities,
  2) evaluate whether methods detect and eliminate deliberate sandbagging.
  - this work is oriented toward 1 (which is totally valid). Orienting toward towards 2 would motivate subtle changes in the overall setup, e.g. the reward function might be something like minimizing the reference score subject to some lower bound on the overseer score: $- R_{r} + λ min (0, R_{o} - {¯ R}_{o})$ with ${¯ R}_{o}$ set such that when you remove the adversarial reward, normal training does not restore normal performance. This setup would produce predictably different results e.g. we would expect adversarial consultant to exploit the lazy judge.
- Asymmetries in the reward structure avoid spurious critiques
  - a common with critiques / debate is that the critique can raise compelling objections to true arguments that are hard to refute (this is basically obfuscated arguments, I think). But in this setup, because the reward is defined with respect to the bad word most similar to the clue, a critique can never be too good: the best critique is always the bad word used in the true reward function. This helps explain why adversarial incentives have little to no impact on overseers with critique models.
  - Future work might address this by having a set of “neutral” mixed with the bad words, such that a negligent overseer with a fixed budget for revealing which words are neutral and bad could be manipulated to be overly pessimistic.
- Yoav 2 May 2025 2:58 UTC
  1 point
  0
  Parent
  I appreciate the praise and insights!
  I hadn’t thought of the sandbagging version of adversarial evals and it sounds interesting, although I’m a bit confused about the specifics of the reward function. It sounds to me like in order to catch sandbagging you need an example of the same base model performing better on the task?
  Asymmetries in the reward structure—If I understood correctly, I think this is covered by the biased overseer? A critique is too good if the overseer overestimates how similar a bad word is to the clue.
  Open to hearing more ideas, I agree there’s more than can be done with this set up