Misha Ramendik comments on I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

Misha Ramendik 30 Jun 2025 1:54 UTC
1 point
0
I might be wildly off here since I never used Claude Code, but would it not be possible to create a git pre-commit hook that would simply fail out when a flag file is not present (and delete the file once succeeded so it can only be used once), while the full test suite routine should delete this file at its start and create this file upon successful completion? Maybe some more messing with flag files to cover the edge cases.

This just seems to ask for an algorithmic guardrail.
- Anon User 30 Jun 2025 3:11 UTC
  2 points
  1
  Parent
  Definitely, and for mypy where I was having similar issues, but where it’s faster to just rerun, I did add it to pre-commit. But my point was about the broader issue that the LLM was perfectly happy to ignore even very strongly worded “this is esse tial for safety” rules, just for some cavalier expediency, which is obviously a worrying trend, assuming it generalizes. And I think my anecdote was slightly more “real life” than a made up grid game of the original research (although of course way less systematically investigated).
  - Misha Ramendik 30 Jun 2025 22:38 UTC
    1 point
    0
    Parent
    Yes, you are right. But now I wonder: did you try also lessening its general drive to commit? It might have been trained on “hyper-agile” workflows where everything absolutely must be a commit or you are committing the sin of Waterfall or of solo coding or something.
    Alongside the safety parts, put stuff in like “it is not essential to commit”, “in my workflow commits are only for finished verified work”, and the like.
    I also don’t know how the environment scaffolds its own prompts so I’m guessing wildly.