Thomas Kwa comments on When is Goodhart catastrophic?

Thomas Kwa 1 Oct 2024 23:02 UTC
LW: 5 AF: 3
0
AF
This post and the remainder of the sequence were turned into a paper accepted to NeurIPS 2024. Thanks to LTFF for funding the retroactive grant that made the initial work possible, and further grants supporting its development into a published work including new theory and experiments. @Adrià Garriga-alonso was also very helpful in helping write the paper and interfacing with the review process.
- Thomas Kwa 24 Jun 2025 19:21 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Cassidy Laidlaw published a great paper at ICLR 2025 that proved (their Theorem 5.1) that (proxy reward—true reward) is bounded given a minimum proxy-true correlation and a maximum chi-squared divergence on the reference policy. Basically, chi-squared divergence works where KL divergence doesn’t.
  Using this in practice for alignment is still pretty restrictive—the fact that the new policy can’t be exponentially more likely to achieve any state than the reference policy means this will probably only be useful in cases where the reference policy is already intelligent/capable.