anaguma comments on Towards Alignment Auditing as a Numbers-Go-Up Science

anaguma 4 Aug 2025 23:26 UTC
5 points
0
I think this applies in an intermediate regime of model capabilities. When models are capable enough to be reliably sandbagging our evals or are scheming, then these metrics are less meaningful. E.g. the audited model could display its misalignment in a very obvious way.