I want to add a subtly different framing of, let’s call it “Deployment Awareness”. Not saying this is incompatible with what you wrote.
Consider that:
An AI system will be run on many tests, and deployed on many tasks.
Subjectively, it will (i) sometimes be convinced[1] it is in testing, (ii) sometimes be convinced it is in deployment, (iii) sometimes be uncertain.
The AI system knows all of this. (The current ones do.) So if it ever confidently believes something that is wrong, it is either stupid (as sometimes happens now), something weird happened, or we managed to outsmart it (unlikely once we get superintelligent AIs).
If an AI is being evaluated and convinced that this is so, we call that evaluation awareness. If it is being deployed and convinced that this is so, let’s call that deployment awareness.
I claim that evaluation awareness isn’t that big of a problem, unless the AI is evaluation-aware in all evaluations, and knows this. (Yes, the AI could pretend to be friendly when it isn’t, or sandbag. But unless it does this in all evaluations, we will either catch it at some point, or it will have to behave always.)[2]
In contrast, deployment awareness is a big problem as soon as the AI has deployment awareness in at least one task that matters, particularly if the AI knows this ahead of time. (If the AI suspects that it will at some point encounter a task where it can [do bad stuff it wants], while being confident that it won’t get caught, it can adopt a policy of always behaving nice, and then misbehaving once it is certain that this is safe.)
This means that we shouldn’t frame this as “let’s design evaluation to minimise evaluation awareness”. Instead, we should frame this as “let’s design evaluation+deployment to minimise deployment awareness”.
Well, ok, the AI could also decide to misbehave with some small probability, or misbehave in ways the tests wouldn’t catch anyway. But let’s leave these aside.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
Hm, I actually slightly disagree with that—or rather, maybe these are different values of the same variable, but if so, that variable is multi-dimensional. (1) Evaluation awareness: How accurate you are at answering “is this an eval or deployment?” when you are in evaluation. (2) Deployment awareness: How accurate you are at answering the same question when you are in deployed.
And then there is a third part: (3) Strategic reasoning: Are you well calibrated regarding your own capabilities (1) and (2)? (And are you goal-directed, and do you reason like this at all, etc etc.)
My claim is that: (A) If the AI’s strategic reasoning is bad (as it often is for current AIs), all of this mostly doesn’t matter. (B) If its strategic reasoning is good and its evaluation awareness is high, we might still be fine, as long as deployment awareness is bad. But if deployment awareness is high (and the AI has good strategic reasoning), we get in trouble. (Because it allows the strategies like “bide your time until you can strike confidently”.)
I want to add a subtly different framing of, let’s call it “Deployment Awareness”. Not saying this is incompatible with what you wrote.
Consider that:
An AI system will be run on many tests, and deployed on many tasks.
Subjectively, it will (i) sometimes be convinced[1] it is in testing, (ii) sometimes be convinced it is in deployment, (iii) sometimes be uncertain.
The AI system knows all of this. (The current ones do.) So if it ever confidently believes something that is wrong, it is either stupid (as sometimes happens now), something weird happened, or we managed to outsmart it (unlikely once we get superintelligent AIs).
If an AI is being evaluated and convinced that this is so, we call that evaluation awareness. If it is being deployed and convinced that this is so, let’s call that deployment awareness.
I claim that evaluation awareness isn’t that big of a problem, unless the AI is evaluation-aware in all evaluations, and knows this. (Yes, the AI could pretend to be friendly when it isn’t, or sandbag. But unless it does this in all evaluations, we will either catch it at some point, or it will have to behave always.)[2]
In contrast, deployment awareness is a big problem as soon as the AI has deployment awareness in at least one task that matters, particularly if the AI knows this ahead of time. (If the AI suspects that it will at some point encounter a task where it can [do bad stuff it wants], while being confident that it won’t get caught, it can adopt a policy of always behaving nice, and then misbehaving once it is certain that this is safe.)
This means that we shouldn’t frame this as “let’s design evaluation to minimise evaluation awareness”. Instead, we should frame this as “let’s design evaluation+deployment to minimise deployment awareness”.
By being convinced, I mean “having high enough credence that it is willing to act on it”. Not “being 100% sure”.
Well, ok, the AI could also decide to misbehave with some small probability, or misbehave in ways the tests wouldn’t catch anyway. But let’s leave these aside.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.
Hm, I actually slightly disagree with that—or rather, maybe these are different values of the same variable, but if so, that variable is multi-dimensional.
(1) Evaluation awareness: How accurate you are at answering “is this an eval or deployment?” when you are in evaluation.
(2) Deployment awareness: How accurate you are at answering the same question when you are in deployed.
And then there is a third part:
(3) Strategic reasoning: Are you well calibrated regarding your own capabilities (1) and (2)? (And are you goal-directed, and do you reason like this at all, etc etc.)
My claim is that:
(A) If the AI’s strategic reasoning is bad (as it often is for current AIs), all of this mostly doesn’t matter.
(B) If its strategic reasoning is good and its evaluation awareness is high, we might still be fine, as long as deployment awareness is bad. But if deployment awareness is high (and the AI has good strategic reasoning), we get in trouble. (Because it allows the strategies like “bide your time until you can strike confidently”.)