How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?
This isn’t a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
If you wanted to have an unaligned LLM that doesn’t abuse humans, couldn’t you just never sample from it after training it to be unaligned?
How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?
This isn’t a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
If you wanted to have an unaligned LLM that doesn’t abuse humans, couldn’t you just never sample from it after training it to be unaligned?