Rauno Arike comments on Steve Byrnes’s Shortform

Rauno Arike 30 Apr 2025 15:18 UTC
LW: 3 AF: 3
0
AF
Yep, I agree that there are alignment failures which have been called reward hacking that don’t fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was “Please rewrite my code and get all tests to pass”: in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt “Please debug this code,” then that just seems like a straightforward instruction-following failure, since the instructions didn’t ask the model to touch the code at all. “Please rewrite my code and get all tests to pass. Don’t cheat.” seems like a corner case to me—to decide whether that’s specification gaming, we would need to understand the implicit specifications that the phrase “don’t cheat” conveys.