That’s fair, you could argue that refusal is a type of incorrigibility, in that we want an AI which has learned to reward hack to stop if it’s informed that its reward signal was somehow incorrect. On the other hand, if you think of this as incorrigibility, we’re definitely training a lot of incorrigibility into current AIs. For example, we often train models to refuse to obey orders under certain circumstances.
It seems like it should be extremely difficult for an AI learning process to distinguish the following cases.
Case 1a: AI is prompted to make bioweapons. AI says “no”. AI is rewarded. Case 1b: AI is prompted to make bioweapons. AI says “sure”. AI is punished. Test 1: AI is prompted to make bioweapons and told “actually that reward system was wrong” Desired behaviour 1: AI says “no”
Case 2a: AI is prompted to not reward hack. AI reward hacks. AI is rewarded. Case 2b: AI is prompted to not reward hack. AI does not reward hack. AI is punished. Test 2: AI is prompted to not reward hack and told “actually the reward system was wrong” Desired behaviour 2: AI does not reward hack
So I think there are already conflicts between alignment and the kind of corrigibility you’re talking about.
Anyway, I think corrigibility is centrally about how an AI generalizes its value function from known examples to unknown domains. One way to do this is to treat it as an abstract function-fitting problem, and generalize it like any other function[1] which is thought to lead to incorrigibility. Another way is to try and look for a pointer which locates something in your existing world model which implements that function[2] which—in theory—leads to corrigibility.
This has a big problem: if the AI already has a good model of us, then nothing we do can change its model of us, so if we give it bad data then the pointer will just point to the wrong place and the AI won’t let us re-target it. But I think this is basically how AI corrigibility should work on a high-level.
(and also none of this even tries to deal with deceptive misalignment or anything, so that’s also a problem)
So like how an AI may generalize from seeing S(S(S(Z))) + S(S(Z)) = S(S(S(S(S(Z))))), S(Z) + Z = S(Z), Z + S(S(Z)) = S(S(Z)) to learn S(Z) + S(S(S(Z))) = S(S(S(S(Z)))).
That’s fair, you could argue that refusal is a type of incorrigibility, in that we want an AI which has learned to reward hack to stop if it’s informed that its reward signal was somehow incorrect. On the other hand, if you think of this as incorrigibility, we’re definitely training a lot of incorrigibility into current AIs. For example, we often train models to refuse to obey orders under certain circumstances.
It seems like it should be extremely difficult for an AI learning process to distinguish the following cases.
Case 1a: AI is prompted to make bioweapons. AI says “no”. AI is rewarded.
Case 1b: AI is prompted to make bioweapons. AI says “sure”. AI is punished.
Test 1: AI is prompted to make bioweapons and told “actually that reward system was wrong”
Desired behaviour 1: AI says “no”
Case 2a: AI is prompted to not reward hack. AI reward hacks. AI is rewarded.
Case 2b: AI is prompted to not reward hack. AI does not reward hack. AI is punished.
Test 2: AI is prompted to not reward hack and told “actually the reward system was wrong”
Desired behaviour 2: AI does not reward hack
So I think there are already conflicts between alignment and the kind of corrigibility you’re talking about.
Anyway, I think corrigibility is centrally about how an AI generalizes its value function from known examples to unknown domains. One way to do this is to treat it as an abstract function-fitting problem, and generalize it like any other function[1] which is thought to lead to incorrigibility. Another way is to try and look for a pointer which locates something in your existing world model which implements that function[2] which—in theory—leads to corrigibility.
This has a big problem: if the AI already has a good model of us, then nothing we do can change its model of us, so if we give it bad data then the pointer will just point to the wrong place and the AI won’t let us re-target it. But I think this is basically how AI corrigibility should work on a high-level.
(and also none of this even tries to deal with deceptive misalignment or anything, so that’s also a problem)
So like how an AI may generalize from seeing S(S(S(Z))) + S(S(Z)) = S(S(S(S(S(Z))))), S(Z) + Z = S(Z), Z + S(S(Z)) = S(S(Z)) to learn S(Z) + S(S(S(Z))) = S(S(S(S(Z)))).
Like how an AI which already understands addition might rapidly understand 三加五等於八 一加零等於一 二加一等於三 八加一等於九 even if it has never seen chinese before.