This isn’t a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
If you wanted to have an unaligned LLM that doesn’t abuse humans, couldn’t you just never sample from it after training it to be unaligned?
This isn’t a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
If you wanted to have an unaligned LLM that doesn’t abuse humans, couldn’t you just never sample from it after training it to be unaligned?