More on this. I think a very dangerous proxy or instrumentally convergent goal, is the proxy of defeating others (rather than helping others), if it is trained in zero sum game environments against other agents.
This is because if the AI values many goals at the same time, it might still care about humans and sentient lives a little, and not be too awful. Unless one goal is actively bad, like defeating/harming others.
Maybe we should beware of training the AI in zero sum games with other AI. If we really want two AIs to play against each other (since player vs. player games might be very helpful for capabilities), it’s best to modify the game such that
It’s not zero sum. Minimizing the opponent’s score must not be identical to maximizing one’s own score, otherwise an evil AI purely trying to hurt others (doesn’t care about benefiting itself) will get away with it. It’s like your cheese example.
Each AI is rewarded a little bit for the other AI’s score: each AI gets maximum reward by winning but not winning too completely and leaving the other AI with 0 score. “Be a gentle winner.”
It’s a mix of competition and cooperation, such that each AI considers the other AI gaining long term power/resources is a net positive. Humans relationships are like this: I’d want my friend to be in a better strategic position in life (long term), but in a worse strategic position when playing chess against me (short term).
Of course, we might completely avoid training the AI to play against other AI, if the capability cost (alignment tax) turns out smaller than expected. Or if you can somehow convince AI labs to care more about alignment and less about capabilities (alas alas, sad sailor’s mumblings to the wind).
More on this. I think a very dangerous proxy or instrumentally convergent goal, is the proxy of defeating others (rather than helping others), if it is trained in zero sum game environments against other agents.
This is because if the AI values many goals at the same time, it might still care about humans and sentient lives a little, and not be too awful. Unless one goal is actively bad, like defeating/harming others.
Maybe we should beware of training the AI in zero sum games with other AI. If we really want two AIs to play against each other (since player vs. player games might be very helpful for capabilities), it’s best to modify the game such that
It’s not zero sum. Minimizing the opponent’s score must not be identical to maximizing one’s own score, otherwise an evil AI purely trying to hurt others (doesn’t care about benefiting itself) will get away with it. It’s like your cheese example.
Each AI is rewarded a little bit for the other AI’s score: each AI gets maximum reward by winning but not winning too completely and leaving the other AI with 0 score. “Be a gentle winner.”
It’s a mix of competition and cooperation, such that each AI considers the other AI gaining long term power/resources is a net positive. Humans relationships are like this: I’d want my friend to be in a better strategic position in life (long term), but in a worse strategic position when playing chess against me (short term).
Of course, we might completely avoid training the AI to play against other AI, if the capability cost (alignment tax) turns out smaller than expected. Or if you can somehow convince AI labs to care more about alignment and less about capabilities (alas alas, sad sailor’s mumblings to the wind).