To take an extreme example: killing all humans, and the plan to kill all humans.
But the AI isn’t going to be so stupid as to write the plan in plaintext, so it’ll likely develop some sort of code.
So then the unaligned behaviour is writing in code, and the output that leads to it is developing use the code.
But maybe the LLM can one shot the code without needing to develop it? Sure but then the validator (by definition if it’s the same LLM) can also tell the code has been developed.
Maybe the LLM can one shot a code to jailbreak the validator? Sure but that starts to break the assumption the LLM isn’t super intelligent in a single pass.
To take an extreme example: killing all humans, and the plan to kill all humans.
But the AI isn’t going to be so stupid as to write the plan in plaintext, so it’ll likely develop some sort of code.
So then the unaligned behaviour is writing in code, and the output that leads to it is developing use the code.
But maybe the LLM can one shot the code without needing to develop it? Sure but then the validator (by definition if it’s the same LLM) can also tell the code has been developed.
Maybe the LLM can one shot a code to jailbreak the validator? Sure but that starts to break the assumption the LLM isn’t super intelligent in a single pass.
How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?
This isn’t a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
If you wanted to have an unaligned LLM that doesn’t abuse humans, couldn’t you just never sample from it after training it to be unaligned?