there’s a question of how correlated the failures of the evaluation proxies are with the failures of the training proxies.
This seems very important! It seems quite hard to check if your proxies are correlated, or to what extent.
One possible experiment here: take your N methods of detecting misalignment on AuditingBench. Choose one method, M_0, and train against it. Then measure (1) how much less the model does that behavior, and (2) how much worse the other methods get at detecting misbehavior when it still does occur.
This seems very important! It seems quite hard to check if your proxies are correlated, or to what extent.
One possible experiment here: take your N methods of detecting misalignment on AuditingBench. Choose one method, M_0, and train against it. Then measure (1) how much less the model does that behavior, and (2) how much worse the other methods get at detecting misbehavior when it still does occur.