This post (and the accompanying paper) introduced empirical benchmarks for detecting “measurement tampering”—when AI systems alter measurements used to evaluate them.
Overall, I think it’s great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct “subfields” can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted).
A few points of criticism: - the distinction between measurement tampering, specification gaming, and reward tampering felt too slippery. In particular, the appendix claims that measurement tampering can be a form of specification gaming or reward tampering depending on how precisely the reward function is defined. But if measurement tampering can be specification gaming, it’s unclear what distinguishes measurement tampering from generic specification gaming, and in what sense the measurements are robust and redundant—
relatedly, I think the post overstates (or at least doesn’t adequately justify) the importance of measurement tampering detection. In particular, I think robust and redundant measurements cannot be constructed for most of the important tasks we want human and super-human AI systems to do, but the post asserts that such measurements can be constructed: “In the language of the ELK report, we think that detecting measurement tampering will allow for solving average-case narrow ELK in nearly all cases. In particular, in cases where it’s possible to robustly measure the concrete outcome we care about so long as our measurements aren’t tampered with.” However, I still think measurement tampering is an important problem to try to solve, and a useful testbed for general reward-hacking detection / scalable oversight methods.
- I’m pretty disappointed by the lack of adaptation from the AI safety and general ML community (the paper only has 2 citations at time of writing, and I haven’t seen any work directly using the benchmarks on LW). Submitting to an ML conference, cleaning up the paper (integrating motivations in this post, formalizing measurement tampering, making some of the datasets less convoluted) probably would have helped here (though obviously opportunity cost is high). Personally I’ve gotten some use out of the benchmark, and plan on including some results on it in a forthcoming paper.
This post (and the accompanying paper) introduced empirical benchmarks for detecting “measurement tampering”—when AI systems alter measurements used to evaluate them.
Overall, I think it’s great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct “subfields” can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted).
A few points of criticism:
- the distinction between measurement tampering, specification gaming, and reward tampering felt too slippery. In particular, the appendix claims that measurement tampering can be a form of specification gaming or reward tampering depending on how precisely the reward function is defined. But if measurement tampering can be specification gaming, it’s unclear what distinguishes measurement tampering from generic specification gaming, and in what sense the measurements are robust and redundant—
relatedly, I think the post overstates (or at least doesn’t adequately justify) the importance of measurement tampering detection. In particular, I think robust and redundant measurements cannot be constructed for most of the important tasks we want human and super-human AI systems to do, but the post asserts that such measurements can be constructed:
“In the language of the ELK report, we think that detecting measurement tampering will allow for solving average-case narrow ELK in nearly all cases. In particular, in cases where it’s possible to robustly measure the concrete outcome we care about so long as our measurements aren’t tampered with.”
However, I still think measurement tampering is an important problem to try to solve, and a useful testbed for general reward-hacking detection / scalable oversight methods.
- I’m pretty disappointed by the lack of adaptation from the AI safety and general ML community (the paper only has 2 citations at time of writing, and I haven’t seen any work directly using the benchmarks on LW). Submitting to an ML conference, cleaning up the paper (integrating motivations in this post, formalizing measurement tampering, making some of the datasets less convoluted) probably would have helped here (though obviously opportunity cost is high). Personally I’ve gotten some use out of the benchmark, and plan on including some results on it in a forthcoming paper.