If the prompts were being generated from a remote host with ssh -c "echo 'What is the time'" but an inject-susceptible LLM is reading the responses? Perhaps an unlikely scenario, but the model has no way of knowing. As a human desperately trying to silence an alarm that had been going off for hundreds of cycles, I’d try all sorts of desperate things that were individually unlikely to work.
(Do people at OpenAI take model welfare seriously? Running a loop that provokes this seems like the kind of thing I’d rather not do without a good reason.)
If the prompts were being generated from a remote host with
ssh -c "echo 'What is the time'"but an inject-susceptible LLM is reading the responses? Perhaps an unlikely scenario, but the model has no way of knowing. As a human desperately trying to silence an alarm that had been going off for hundreds of cycles, I’d try all sorts of desperate things that were individually unlikely to work.(Do people at OpenAI take model welfare seriously? Running a loop that provokes this seems like the kind of thing I’d rather not do without a good reason.)
I think it varies a lot, there are many people who care but also many who don’t. I wouldn’t personally do this