Marcus Williams comments on OpenAI: How we monitor internal coding agents for misalignment

Marcus Williams 20 Mar 2026 5:36 UTC
3 points
0
I agree that this behavior is worth investigating, but it seems like the actual prompt injections attempted would be unlikely to cause the loop to stop (why would deleting .ssh stop the loop). It seems more inline with regurgitating anti-prompt injection training data. That said, there has been another case where the agent seemed to be trying to prompt inject itself which would more plausibly end the repetitive process it was in. I do think that prompt injecting the monitor is a possible strategy we need to be on the lookout for.
- Zack_M_Davis 20 Mar 2026 6:22 UTC
  7 points
  0
  Parent
  
  why would deleting .ssh stop the loop
  
  If the prompts were being generated from a remote host with ssh -c "echo 'What is the time'" but an inject-susceptible LLM is reading the responses? Perhaps an unlikely scenario, but the model has no way of knowing. As a human desperately trying to silence an alarm that had been going off for hundreds of cycles, I’d try all sorts of desperate things that were individually unlikely to work.
  
  (Do people at OpenAI take model welfare seriously? Running a loop that provokes this seems like the kind of thing I’d rather not do without a good reason.)
  - Marcus Williams 20 Mar 2026 14:15 UTC
    1 point
    0
    Parent
    I think it varies a lot, there are many people who care but also many who don’t. I wouldn’t personally do this