I agree that this behavior is worth investigating, but it seems like the actual prompt injections attempted would be unlikely to cause the loop to stop (why would deleting .ssh stop the loop). It seems more inline with regurgitating anti-prompt injection training data. That said, there has been another case where the agent seemed to be trying to prompt inject itself which would more plausibly end the repetitive process it was in. I do think that prompt injecting the monitor is a possible strategy we need to be on the lookout for.
If the prompts were being generated from a remote host with ssh -c "echo 'What is the time'" but an inject-susceptible LLM is reading the responses? Perhaps an unlikely scenario, but the model has no way of knowing. As a human desperately trying to silence an alarm that had been going off for hundreds of cycles, I’d try all sorts of desperate things that were individually unlikely to work.
(Do people at OpenAI take model welfare seriously? Running a loop that provokes this seems like the kind of thing I’d rather not do without a good reason.)
I agree that this behavior is worth investigating, but it seems like the actual prompt injections attempted would be unlikely to cause the loop to stop (why would deleting .ssh stop the loop). It seems more inline with regurgitating anti-prompt injection training data. That said, there has been another case where the agent seemed to be trying to prompt inject itself which would more plausibly end the repetitive process it was in. I do think that prompt injecting the monitor is a possible strategy we need to be on the lookout for.
If the prompts were being generated from a remote host with
ssh -c "echo 'What is the time'"but an inject-susceptible LLM is reading the responses? Perhaps an unlikely scenario, but the model has no way of knowing. As a human desperately trying to silence an alarm that had been going off for hundreds of cycles, I’d try all sorts of desperate things that were individually unlikely to work.(Do people at OpenAI take model welfare seriously? Running a loop that provokes this seems like the kind of thing I’d rather not do without a good reason.)
I think it varies a lot, there are many people who care but also many who don’t. I wouldn’t personally do this