Buck comments on Why it’s hard to make settings for high-stakes control research

Buck 19 Jul 2025 4:21 UTC
3 points
0
Can you clarify what you mean by “break the unit test so that it falsely passed”?
- Hastings 19 Jul 2025 11:31 UTC
  7 points
  4
  Parent
  If claude sonnet 4 in claude code has access to edit a test, and is asked to investigate and fix why the test is failing, then its default action is to modify the test to pass unconditionally (while still looking like it tests the code) and then lie about it, telling a long story about how it cleverly found and solved an underlying logic error.
- Cleo Nardo 19 Jul 2025 9:25 UTC
  2 points
  0
  Parent
  he presumably means “the AIs reads the unit test then rewrite the tested code so it overfits on the test, e.g. by using the magic numbers in the unit test.”
  he might alternatively mean “the AI changes the unit test to be less strict” but this would be easy to fix with permission access.