If so, it seems that all you need to do to detect any unwanted behaviour from a superintelligent system is to feed all output from constituent LLMs to a simpler LLM to detect output that looks like it’s leading towards unaligned behaviour. Only once the output has been verified, pass it on to the next system (including looping it back to itself to output more tokens). If it fails verification immediately stop the whole system.
What is unaligned behavior and what does output that leads to it look like?
To take an extreme example: killing all humans, and the plan to kill all humans.
But the AI isn’t going to be so stupid as to write the plan in plaintext, so it’ll likely develop some sort of code.
So then the unaligned behaviour is writing in code, and the output that leads to it is developing use the code.
But maybe the LLM can one shot the code without needing to develop it? Sure but then the validator (by definition if it’s the same LLM) can also tell the code has been developed.
Maybe the LLM can one shot a code to jailbreak the validator? Sure but that starts to break the assumption the LLM isn’t super intelligent in a single pass.
What is unaligned behavior and what does output that leads to it look like?
To take an extreme example: killing all humans, and the plan to kill all humans.
But the AI isn’t going to be so stupid as to write the plan in plaintext, so it’ll likely develop some sort of code.
So then the unaligned behaviour is writing in code, and the output that leads to it is developing use the code.
But maybe the LLM can one shot the code without needing to develop it? Sure but then the validator (by definition if it’s the same LLM) can also tell the code has been developed.
Maybe the LLM can one shot a code to jailbreak the validator? Sure but that starts to break the assumption the LLM isn’t super intelligent in a single pass.
How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?
This isn’t a solution to aligned LLMs being abused by humans, but to unaligned LLMs abusing humans.
If you wanted to have an unaligned LLM that doesn’t abuse humans, couldn’t you just never sample from it after training it to be unaligned?