1) Scaffolding: use rule-based AI to check every solution provided by GPT part. It could work for computations or self-driving or robotics, but not against elaborated adversarial plots.
2) Many instances. Run GPT several times and choose random or best answer—we already doing this. Run several instances of GPT with different parameters or different training base and compare answers. Run different prompt. Median output seems to be a Shelling point around truth, and outstanding answers are more likely to be wrong or malicious.
3) Use intrinsic GPT properties to prevent malicious behaviour. For example, higher temperature increases randomness of the output and mess up with any internal mesa optimisers. Shorter prompts and lack of long memory also prevents complex plotting.
4) Train and test on ethical database.
5) Use prompts which include notion of safety, like “A benevolent AI will say...” or counterfactuals which prevents complex planing in real world (An AI on the Moon)
6) Black boxing of internal parts of the system like the NN code.
7) Run it million times in test environments or tasks.
8) Use another GPT AI to make “safety TL;DR” of any output or prediction of possible bad things which could happen from a given output.
Disclaimer: Safer AI is not provably safe. It is just orders of magnitude safer than unsafe one, but it will eventually fail.
Some random ideas how to make GPT-base AI safer.
1) Scaffolding: use rule-based AI to check every solution provided by GPT part. It could work for computations or self-driving or robotics, but not against elaborated adversarial plots.
2) Many instances. Run GPT several times and choose random or best answer—we already doing this. Run several instances of GPT with different parameters or different training base and compare answers. Run different prompt. Median output seems to be a Shelling point around truth, and outstanding answers are more likely to be wrong or malicious.
3) Use intrinsic GPT properties to prevent malicious behaviour. For example, higher temperature increases randomness of the output and mess up with any internal mesa optimisers. Shorter prompts and lack of long memory also prevents complex plotting.
4) Train and test on ethical database.
5) Use prompts which include notion of safety, like “A benevolent AI will say...” or counterfactuals which prevents complex planing in real world (An AI on the Moon)
6) Black boxing of internal parts of the system like the NN code.
7) Run it million times in test environments or tasks.
8) Use another GPT AI to make “safety TL;DR” of any output or prediction of possible bad things which could happen from a given output.
Disclaimer: Safer AI is not provably safe. It is just orders of magnitude safer than unsafe one, but it will eventually fail.