If we conclude the AI is safe, we’ll want to set it free and have it do useful work. An extremely risk-loving, short-termist AI doesn’t sound very useful. I wouldn’t want a loyal AI to execute plans with a one-in-a-billion chance of success and a high cost of failure.
In other words: how will you make the AI bad at rebelling without making it bad at everything else?
I see. But that assumes your AI implementation supports such a tuneable parameter and you can be confident about testing with one parameter value and then predicting a run with a very different valuee.
Only according to more standard motivational structures.
If we conclude the AI is safe, we’ll want to set it free and have it do useful work. An extremely risk-loving, short-termist AI doesn’t sound very useful. I wouldn’t want a loyal AI to execute plans with a one-in-a-billion chance of success and a high cost of failure.
In other words: how will you make the AI bad at rebelling without making it bad at everything else?
This is used as a test or filter. Once the AI has passed that test, you turn its risk aversion to normal.
I see. But that assumes your AI implementation supports such a tuneable parameter and you can be confident about testing with one parameter value and then predicting a run with a very different valuee.
Yes. It is a check for certain designs, not a universal panacea (rule of thumb: universal panacea go in main ;-)