I think this applies in an intermediate regime of model capabilities. When models are capable enough to be reliably sandbagging our evals or are scheming, then these metrics are less meaningful. E.g. the audited model could display its misalignment in a very obvious way.
I think this applies in an intermediate regime of model capabilities. When models are capable enough to be reliably sandbagging our evals or are scheming, then these metrics are less meaningful. E.g. the audited model could display its misalignment in a very obvious way.