I partially agree: I think held-out tests are a great way to verify aligned cognition, especially if we never train on anything close to making that cognition look aligned. But also, it might be hard to get aligned cognition in high-stakes deployments if we never train on anything close to that. There are some unresolved tensions here within my post: Section 3 aligns with your point, while Sections 4 and 5 offer my best current ideas for getting the benefits of training without losing monitorability. (You can just read the section titles for the gist.)
If we use targeted interventions as suggested in Section 5, especially when using multiple strong and diverse proxies as suggested in Section 4, I think we may be able to build some confidence that a) it is implausible for this training setup to teach obfuscated reasoning and b) this training setup will teach aligned reasoning/behavior/motivations that generalize at least moderately well. I think there’s quite a bit of room for advancing our understanding of how to do this well (and whether it’s possible), especially with experiments but maybe also with more conceptual arguments.
I partially agree: I think held-out tests are a great way to verify aligned cognition, especially if we never train on anything close to making that cognition look aligned. But also, it might be hard to get aligned cognition in high-stakes deployments if we never train on anything close to that. There are some unresolved tensions here within my post: Section 3 aligns with your point, while Sections 4 and 5 offer my best current ideas for getting the benefits of training without losing monitorability. (You can just read the section titles for the gist.)
If we use targeted interventions as suggested in Section 5, especially when using multiple strong and diverse proxies as suggested in Section 4, I think we may be able to build some confidence that a) it is implausible for this training setup to teach obfuscated reasoning and b) this training setup will teach aligned reasoning/behavior/motivations that generalize at least moderately well. I think there’s quite a bit of room for advancing our understanding of how to do this well (and whether it’s possible), especially with experiments but maybe also with more conceptual arguments.