Steven Byrnes comments on Steve Byrnes’s Shortform

Steven Byrnes 30 Apr 2025 14:09 UTC
LW: 3 AF: 3
0
AF
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?
- Kaj_Sotala 1 May 2025 10:30 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Reminds me of
- Rauno Arike 30 Apr 2025 15:18 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Yep, I agree that there are alignment failures which have been called reward hacking that don’t fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was “Please rewrite my code and get all tests to pass”: in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt “Please debug this code,” then that just seems like a straightforward instruction-following failure, since the instructions didn’t ask the model to touch the code at all. “Please rewrite my code and get all tests to pass. Don’t cheat.” seems like a corner case to me—to decide whether that’s specification gaming, we would need to understand the implicit specifications that the phrase “don’t cheat” conveys.