I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
Hm, I actually slightly disagree with that—or rather, maybe these are different values of the same variable, but if so, that variable is multi-dimensional. (1) Evaluation awareness: How accurate you are at answering “is this an eval or deployment?” when you are in evaluation. (2) Deployment awareness: How accurate you are at answering the same question when you are in deployed.
And then there is a third part: (3) Strategic reasoning: Are you well calibrated regarding your own capabilities (1) and (2)? (And are you goal-directed, and do you reason like this at all, etc etc.)
My claim is that: (A) If the AI’s strategic reasoning is bad (as it often is for current AIs), all of this mostly doesn’t matter. (B) If its strategic reasoning is good and its evaluation awareness is high, we might still be fine, as long as deployment awareness is bad. But if deployment awareness is high (and the AI has good strategic reasoning), we get in trouble. (Because it allows the strategies like “bide your time until you can strike confidently”.)
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.
Hm, I actually slightly disagree with that—or rather, maybe these are different values of the same variable, but if so, that variable is multi-dimensional.
(1) Evaluation awareness: How accurate you are at answering “is this an eval or deployment?” when you are in evaluation.
(2) Deployment awareness: How accurate you are at answering the same question when you are in deployed.
And then there is a third part:
(3) Strategic reasoning: Are you well calibrated regarding your own capabilities (1) and (2)? (And are you goal-directed, and do you reason like this at all, etc etc.)
My claim is that:
(A) If the AI’s strategic reasoning is bad (as it often is for current AIs), all of this mostly doesn’t matter.
(B) If its strategic reasoning is good and its evaluation awareness is high, we might still be fine, as long as deployment awareness is bad. But if deployment awareness is high (and the AI has good strategic reasoning), we get in trouble. (Because it allows the strategies like “bide your time until you can strike confidently”.)