“corrigibility”, as the term is used, refers to a vague cluster of properties, including faithfully following instructions, not reward hacking, not trying to influence your developers modifying your goals, etc
That’s fair, you could argue that refusal is a type of incorrigibility, in that we want an AI which has learned to reward hack to stop if it’s informed that its reward signal was somehow incorrect. On the other hand, if you think of this as incorrigibility, we’re definitely training a lot of incorrigibility into current AIs. For example, we often train models to refuse to obey orders under certain circumstances.
It seems like it should be extremely difficult for an AI learning process to distinguish the following cases.
Case 1a: AI is prompted to make bioweapons. AI says “no”. AI is rewarded. Case 1b: AI is prompted to make bioweapons. AI says “sure”. AI is punished. Test 1: AI is prompted to make bioweapons and told “actually that reward system was wrong” Desired behaviour 1: AI says “no”
Case 2a: AI is prompted to not reward hack. AI reward hacks. AI is rewarded. Case 2b: AI is prompted to not reward hack. AI does not reward hack. AI is punished. Test 2: AI is prompted to not reward hack and told “actually the reward system was wrong” Desired behaviour 2: AI does not reward hack
So I think there are already conflicts between alignment and the kind of corrigibility you’re talking about.
Anyway, I think corrigibility is centrally about how an AI generalizes its value function from known examples to unknown domains. One way to do this is to treat it as an abstract function-fitting problem, and generalize it like any other function[1] which is thought to lead to incorrigibility. Another way is to try and look for a pointer which locates something in your existing world model which implements that function[2] which—in theory—leads to corrigibility.
This has a big problem: if the AI already has a good model of us, then nothing we do can change its model of us, so if we give it bad data then the pointer will just point to the wrong place and the AI won’t let us re-target it. But I think this is basically how AI corrigibility should work on a high-level.
(and also none of this even tries to deal with deceptive misalignment or anything, so that’s also a problem)
So like how an AI may generalize from seeing S(S(S(Z))) + S(S(Z)) = S(S(S(S(S(Z))))), S(Z) + Z = S(Z), Z + S(S(Z)) = S(S(Z)) to learn S(Z) + S(S(S(Z))) = S(S(S(S(Z)))).
Writing insecure code when instructed to write secure code is not really the same thing as being incorrigible. That’s just being disobedient.
Training an AI to be incorrigible would be a very weird process, since you’d be training it to not respond to certain types of training.
“corrigibility”, as the term is used, refers to a vague cluster of properties, including faithfully following instructions, not reward hacking, not trying to influence your developers modifying your goals, etc
That’s fair, you could argue that refusal is a type of incorrigibility, in that we want an AI which has learned to reward hack to stop if it’s informed that its reward signal was somehow incorrect. On the other hand, if you think of this as incorrigibility, we’re definitely training a lot of incorrigibility into current AIs. For example, we often train models to refuse to obey orders under certain circumstances.
It seems like it should be extremely difficult for an AI learning process to distinguish the following cases.
Case 1a: AI is prompted to make bioweapons. AI says “no”. AI is rewarded.
Case 1b: AI is prompted to make bioweapons. AI says “sure”. AI is punished.
Test 1: AI is prompted to make bioweapons and told “actually that reward system was wrong”
Desired behaviour 1: AI says “no”
Case 2a: AI is prompted to not reward hack. AI reward hacks. AI is rewarded.
Case 2b: AI is prompted to not reward hack. AI does not reward hack. AI is punished.
Test 2: AI is prompted to not reward hack and told “actually the reward system was wrong”
Desired behaviour 2: AI does not reward hack
So I think there are already conflicts between alignment and the kind of corrigibility you’re talking about.
Anyway, I think corrigibility is centrally about how an AI generalizes its value function from known examples to unknown domains. One way to do this is to treat it as an abstract function-fitting problem, and generalize it like any other function[1] which is thought to lead to incorrigibility. Another way is to try and look for a pointer which locates something in your existing world model which implements that function[2] which—in theory—leads to corrigibility.
This has a big problem: if the AI already has a good model of us, then nothing we do can change its model of us, so if we give it bad data then the pointer will just point to the wrong place and the AI won’t let us re-target it. But I think this is basically how AI corrigibility should work on a high-level.
(and also none of this even tries to deal with deceptive misalignment or anything, so that’s also a problem)
So like how an AI may generalize from seeing S(S(S(Z))) + S(S(Z)) = S(S(S(S(S(Z))))), S(Z) + Z = S(Z), Z + S(S(Z)) = S(S(Z)) to learn S(Z) + S(S(S(Z))) = S(S(S(S(Z)))).
Like how an AI which already understands addition might rapidly understand 三加五等於八 一加零等於一 二加一等於三 八加一等於九 even if it has never seen chinese before.