ryan_greenblatt comments on Towards Alignment Auditing as a Numbers-Go-Up Science

ryan_greenblatt 5 Aug 2025 1:59 UTC
LW: 14 AF: 9
8
AF
This seems pretty reasonable (strong upvoted).

These aren’t disagreements with the post, but here are some of my thoughts (most of which Sam already knows, but I’m sharing for third parties):

My largest concern is that the auditing test beds will end up being pretty disanalogous to the cases of misalignment I find most concerning and thus hill climbing on these test beds isn’t that useful. (You discuss the need for high quality environments in the appendix. I agree.) In practice, my understanding is that constructing good auditing test beds has been a huge pain, especially because current AI systems have trouble “keeping their shit together” (and generally are unreliable at concealing secret aims) which makes behavioral red-teaming very powerful. I don’t expect behavioral red teaming to work for the AI systems which I’m most worried about.^[1] To be clear, the issues I discuss in this paragraph generally apply to testing alignment auditing methods.

Another concern is that current AI systems (and future AIs which aren’t much more capable) are pretty different than humans which could limit transfer in the conclusions, though this doesn’t mean that hill climbing wouldn’t have substantial transfer. E.g., maybe some tool helps AI agents uncover some issues with relatively less need for cognitive effort or agency or noticing something and this makes a huge difference to the bottom line due to AI capability limitations. But, humans would only be marginally helped by the tool because they could just apply more effort.
1. ↩︎
  There is a decent chance that very powerful AIs still have trouble keeping their shit together, but these AIs correspondingly are much easier to alignment audit (at least for hidden intentions) such that I also think that these worlds are less leveraged to work on.
- Sam Marks 5 Aug 2025 2:38 UTC
  LW: 11 AF: 8
  2
  AF Parent
  constructing good auditing test beds has been a huge pain
  Yeah, I confirm this is the case. I estimate that the model organism from Auditing Language Models for Hidden Objectives took 8-12 FTE-months, which I think most people find surprisingly expensive. That said, 30% of this time probably went into implementing synthetic documenting finetuning and testing that it worked. I’d guess I could now make another model organism of similar quality in 1.5-3 FTE-months.