Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with. A model will struggle to fake its capabilities after specific fine tuning (requires advanced gradient hacking), but faking alignment properties seems much easier.
Due to behavioral-evals struggling to detect deception, AI labs widely adopting a safety framework like behavioral-evaluations from external auditors (without access to checkpoints/transparency) might make it very hard to detect deception in the future.
Evan suggests pairing capabilities-evaluations with understanding-based evals. These understanding-evals would focus on how well we can predict models’ behavior, but of course it’s hard to define what this means.[edited in response to Evan’s comment:] These understanding-evals would focus on how well developers understand the models they’re creating, but of course it’s hard to define what it means to have such an understanding. “Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer’s ability to understand what sort of model they got and why they got it.” Desiderata for these evals including being method-agnostic and having a high enough level of understanding to detect dangerous failures. Potential strategies to build off: causal scrubbing, auditing games, prediction-based evaluation, fixing behavior via removing data from training rather than adding it; Evan talks about ways in which each of these is insufficient currently.
At first I was confused about why faking capabilities will be harder than faking alignment. After thinking, here’s my explanation:
When we fine-tune on specific capabilities, we will have gradients pushing toward the model improving on these capabilities (unless your model is gradient hacking). These gradient updates are likely to lead to eliciting all of the possible capabilities of your model (with enough fine-tuning). There is a capabilities ground truth being trained against. A deceptively aligned model may try to hide its maximum capabilities, but there are strong gradients pushing against this unless it is gradient hacking.
On the other hand, faking alignment is much easier. When you’re a deceptively aligned model acting aligned during training, there’s no training signal smacking you in a different direction, you’re doing great on those alignment benchmarks. Being deceptively aligned implies that your model is instrumentally pursuing the training objective, so gradients don’t affect its mesa-objective. Beyond this “you’re doing just as well as an aligned model” problem, we also don’t have a reliable ground truth alignment measure to even be doing this fine-tuning on, like we do for capabilities.
Thanks, Gabe, for this simple framing: Acting aligned gets low loss, whereas acting dumb (incapable) gets high loss, in the fine-tuning.
My summary:
Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with. A model will struggle to fake its capabilities after specific fine tuning (requires advanced gradient hacking), but faking alignment properties seems much easier.
Due to behavioral-evals struggling to detect deception, AI labs widely adopting a safety framework like behavioral-evaluations from external auditors (without access to checkpoints/transparency) might make it very hard to detect deception in the future.
Evan suggests pairing capabilities-evaluations with understanding-based evals.
These understanding-evals would focus on how well we can predict models’ behavior, but of course it’s hard to define what this means.[edited in response to Evan’s comment:] These understanding-evals would focus on how well developers understand the models they’re creating, but of course it’s hard to define what it means to have such an understanding. “Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer’s ability to understand what sort of model they got and why they got it.” Desiderata for these evals including being method-agnostic and having a high enough level of understanding to detect dangerous failures. Potential strategies to build off: causal scrubbing, auditing games, prediction-based evaluation, fixing behavior via removing data from training rather than adding it; Evan talks about ways in which each of these is insufficient currently.At first I was confused about why faking capabilities will be harder than faking alignment. After thinking, here’s my explanation:
When we fine-tune on specific capabilities, we will have gradients pushing toward the model improving on these capabilities (unless your model is gradient hacking). These gradient updates are likely to lead to eliciting all of the possible capabilities of your model (with enough fine-tuning). There is a capabilities ground truth being trained against. A deceptively aligned model may try to hide its maximum capabilities, but there are strong gradients pushing against this unless it is gradient hacking.
On the other hand, faking alignment is much easier. When you’re a deceptively aligned model acting aligned during training, there’s no training signal smacking you in a different direction, you’re doing great on those alignment benchmarks. Being deceptively aligned implies that your model is instrumentally pursuing the training objective, so gradients don’t affect its mesa-objective. Beyond this “you’re doing just as well as an aligned model” problem, we also don’t have a reliable ground truth alignment measure to even be doing this fine-tuning on, like we do for capabilities.
Thanks, Gabe, for this simple framing: Acting aligned gets low loss, whereas acting dumb (incapable) gets high loss, in the fine-tuning.
This looks basically right, except:
I definitely don’t think this—I explicitly talk about my problems with prediction-based evaluations in the post.
Thanks for the correction. I edited my original comment to reflect it.