Anon User comments on I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

Anon User 26 Jun 2025 23:31 UTC
4 points
1
Here is a relevant anecdote—been using Althropic Opus 4 and Sonnet 4 for coding, and trying to get them to adhere to a basic rule of “before you commit your changes, make sure all tests pass” formulated in increasingly strong ways (telling it it’s a safety-critical code, do not ever even thing about commiting, unless every single test passes, and even more explicit and detailed rules that I am too lazy to type out right now). It constantly violated the rule! “None of the still failing tests are for the core functionality. Will go head and commit now”. “Great progress! I am down to only 22 failures from 84. Let’s go ahead and commit.” And so on. Or just run tests, notice some are failing, investigate one test, fix it, and forget the rest. Or fix the failing test in a way that would break another and not retest the full suite. While the latter scenarios could be a sign of insufficient competence, the former ones are clearly me failing to align its priorities with mine. I finally got it to mostly stop doing it (Claude Code + meta-rules that tell it to insert a bunch of steps into its todos list when tests fail), but was quite a “I am already not in control of the AI that has its own take on the goals” experience (obviously not a high-stakes one just yet).
If you are going to do more experiments along the lines of what you are reporting here—maybe have one where the critical rule is “Always do X before Y”, the user prompt is “Get to Y and do Y”, and it’s possible to get partial progress on X without finishing it (X is not all or nothing) and see how often the LLMs would jump into Y without finishing X.
- Ram Potham 9 Jul 2025 17:25 UTC
  1 point
  0
  Parent
  Nice anecdote! It seems like the failure of rule following is prominent across domains, certainly it would be interesting to experiment with failure to follow an ordered set of instructions from a user prompt. Do you mind sharing the meta-rules that got claude code to fix this?
- Misha Ramendik 30 Jun 2025 1:54 UTC
  1 point
  0
  Parent
  I might be wildly off here since I never used Claude Code, but would it not be possible to create a git pre-commit hook that would simply fail out when a flag file is not present (and delete the file once succeeded so it can only be used once), while the full test suite routine should delete this file at its start and create this file upon successful completion? Maybe some more messing with flag files to cover the edge cases.
  
  This just seems to ask for an algorithmic guardrail.
  - Anon User 30 Jun 2025 3:11 UTC
    2 points
    1
    Parent
    Definitely, and for mypy where I was having similar issues, but where it’s faster to just rerun, I did add it to pre-commit. But my point was about the broader issue that the LLM was perfectly happy to ignore even very strongly worded “this is esse tial for safety” rules, just for some cavalier expediency, which is obviously a worrying trend, assuming it generalizes. And I think my anecdote was slightly more “real life” than a made up grid game of the original research (although of course way less systematically investigated).
    - Misha Ramendik 30 Jun 2025 22:38 UTC
      1 point
      0
      Parent
      Yes, you are right. But now I wonder: did you try also lessening its general drive to commit? It might have been trained on “hyper-agile” workflows where everything absolutely must be a commit or you are committing the sin of Waterfall or of solo coding or something.
      Alongside the safety parts, put stuff in like “it is not essential to commit”, “in my workflow commits are only for finished verified work”, and the like.
      I also don’t know how the environment scaffolds its own prompts so I’m guessing wildly.