If we conclude the AI is safe, we’ll want to set it free and have it do useful work. An extremely risk-loving, short-termist AI doesn’t sound very useful. I wouldn’t want a loyal AI to execute plans with a one-in-a-billion chance of success and a high cost of failure.
In other words: how will you make the AI bad at rebelling without making it bad at everything else?
I see. But that assumes your AI implementation supports such a tuneable parameter and you can be confident about testing with one parameter value and then predicting a run with a very different valuee.
If we conclude the AI is safe, we’ll want to set it free and have it do useful work. An extremely risk-loving, short-termist AI doesn’t sound very useful. I wouldn’t want a loyal AI to execute plans with a one-in-a-billion chance of success and a high cost of failure.
In other words: how will you make the AI bad at rebelling without making it bad at everything else?
This is used as a test or filter. Once the AI has passed that test, you turn its risk aversion to normal.
I see. But that assumes your AI implementation supports such a tuneable parameter and you can be confident about testing with one parameter value and then predicting a run with a very different valuee.
Yes. It is a check for certain designs, not a universal panacea (rule of thumb: universal panacea go in main ;-)