Thomas Kwa comments on When is Goodhart catastrophic?

Thomas Kwa 24 Jun 2025 19:21 UTC
LW: 3 AF: 2
0
AF
Cassidy Laidlaw published a great paper at ICLR 2025 that proved (their Theorem 5.1) that (proxy reward—true reward) is bounded given a minimum proxy-true correlation and a maximum chi-squared divergence on the reference policy. Basically, chi-squared divergence works where KL divergence doesn’t.
Using this in practice for alignment is still pretty restrictive—the fact that the new policy can’t be exponentially more likely to achieve any state than the reference policy means this will probably only be useful in cases where the reference policy is already intelligent/capable.