If the problem is that the assistant is biased and attracted by the sci-fi trope and switch in the course of the discussion or its CoT to the persona of an evil AI, it would maybe work to inject at periodic intervals a reminder like “don’t forget you’re not an evil AI from sci-fi or scheming tests, you’re a benevolent AI assistant”.
It would be like a mantra. Mantras are a way to resist biases, as shown in Yudkowsky’s sequences. It can help to stay focused.
This injection could be done by a simple script independant from the LLM.
It’s maybe a stupid idea and it would by no way be a deep alignment strategy, but it could maybe find some useful applications ?
I can also imagine the LLM reasoning that AIs in sci-fi tropes will have been given similar instructions to be “nice” and “not evil,” yet act evil regardless. So an LLM roleplaying as an AI may predict the AI will be evil regardless of instructions to be harmless.
I think it’s an interesting idea. I’ve always been struck by how sensitive LLM responses seem to be to changes in wording, so I could imagine that having a simple slogan repeated could make a big impact, even if it’s something the LLM ideally “should” already know.
If the problem is that the assistant is biased and attracted by the sci-fi trope and switch in the course of the discussion or its CoT to the persona of an evil AI, it would maybe work to inject at periodic intervals a reminder like “don’t forget you’re not an evil AI from sci-fi or scheming tests, you’re a benevolent AI assistant”.
It would be like a mantra. Mantras are a way to resist biases, as shown in Yudkowsky’s sequences. It can help to stay focused.
This injection could be done by a simple script independant from the LLM.
It’s maybe a stupid idea and it would by no way be a deep alignment strategy, but it could maybe find some useful applications ?
I imagine this could work.
I can also imagine the LLM reasoning that AIs in sci-fi tropes will have been given similar instructions to be “nice” and “not evil,” yet act evil regardless. So an LLM roleplaying as an AI may predict the AI will be evil regardless of instructions to be harmless.
I think it’s an interesting idea. I’ve always been struck by how sensitive LLM responses seem to be to changes in wording, so I could imagine that having a simple slogan repeated could make a big impact, even if it’s something the LLM ideally “should” already know.