I agree about the distinction and agree that both invariance and unawareness are worth pursuing to get better use out of evaluation and auditing.
However, I am slightly more positive towards unawareness and slightly less positive towards invariance.
When I imagine an AI powerful enough that it would cause a catastrophe, it is because it has a goal regarding the outcome of the world that is misaligned with human goals. I have a hard time imagining such a goal that would be invariant to whether it was being evaluated or not since evaluation could lead to the model being retrained. It seems to me that in order to solve eval invariance for sufficiently intelligent and agentic systems we would have to solve most of what makes alignment difficult in the first place.
Regarding unawareness I agree that it won’t scale to ASI and that Eval Awareness for current evals requires only Human level intelligence. However, I believe there is a lot of headroom to make our evals more realistic. For instance, if we have prompts from deployments that led to concerning behavior for previous models, we could subject new models to a deployment distribution of prompts and manually inspect the ones that previously led to misaligned behavior. The only fundamental difference between deployment and evals is that deployment will happen in the future. So a sufficiently inteligent model could figure this out by for example waiting to see a factorisation of RSA-2048.
Since neither approach would work on a sufficiently intelligent and agentic systems, the question is what scales furthest.
It seems worth pointing out that humans may be eval-unaware but we don’t want to be. The simulation hypothesis and the existence of an afterlife are things that humans do care about. If we could find out for sure whether we’re in a simulation, most people would want to. We do care about whether the world we live in is just a morality-test to gain access to the Good Place.
Humans aren’t eval-invariant either. Humans who believe in a morality-segregated afterlife don’t tend to be more moral than humans who believe death is the end; but they do differ in some observable behaviors. (A cynical atheist might say they are more sycophantic.)
I agree about the distinction and agree that both invariance and unawareness are worth pursuing to get better use out of evaluation and auditing.
However, I am slightly more positive towards unawareness and slightly less positive towards invariance.
When I imagine an AI powerful enough that it would cause a catastrophe, it is because it has a goal regarding the outcome of the world that is misaligned with human goals. I have a hard time imagining such a goal that would be invariant to whether it was being evaluated or not since evaluation could lead to the model being retrained. It seems to me that in order to solve eval invariance for sufficiently intelligent and agentic systems we would have to solve most of what makes alignment difficult in the first place.
Regarding unawareness I agree that it won’t scale to ASI and that Eval Awareness for current evals requires only Human level intelligence. However, I believe there is a lot of headroom to make our evals more realistic. For instance, if we have prompts from deployments that led to concerning behavior for previous models, we could subject new models to a deployment distribution of prompts and manually inspect the ones that previously led to misaligned behavior. The only fundamental difference between deployment and evals is that deployment will happen in the future. So a sufficiently inteligent model could figure this out by for example waiting to see a factorisation of RSA-2048.
Since neither approach would work on a sufficiently intelligent and agentic systems, the question is what scales furthest.
It seems worth pointing out that humans may be eval-unaware but we don’t want to be. The simulation hypothesis and the existence of an afterlife are things that humans do care about. If we could find out for sure whether we’re in a simulation, most people would want to. We do care about whether the world we live in is just a morality-test to gain access to the Good Place.
Humans aren’t eval-invariant either. Humans who believe in a morality-segregated afterlife don’t tend to be more moral than humans who believe death is the end; but they do differ in some observable behaviors. (A cynical atheist might say they are more sycophantic.)