Drive by hot take, my apologies for not having yet read the actual post:
Desired behavior in training is itself just a proxy for what we really care about, which is desired behavior in the high-stakes deployment cases we really care about.
So yeah I don’t think there’s a strong principled reason to train on desired behavior but not on CoT or whatever. However, it’s very important that there be SOME held-out test set so to speak that lets you tell what the model is really thinking, that you don’t train on, and you don’t train on anything remotely similar to it either, so that you can make a decent case that the model hasn’t learned to fool your test, so that you can make a decent case that it’s a good case. And I think CoT is the best candidate to play that role, though I’m open to other suggestions. (And ideally we’d have multiple held-out test sets instead of just one tbf)
I partially agree: I think held-out tests are a great way to verify aligned cognition, especially if we never train on anything close to making that cognition look aligned. But also, it might be hard to get aligned cognition in high-stakes deployments if we never train on anything close to that. There are some unresolved tensions here within my post: Section 3 aligns with your point, while Sections 4 and 5 offer my best current ideas for getting the benefits of training without losing monitorability. (You can just read the section titles for the gist.)
If we use targeted interventions as suggested in Section 5, especially when using multiple strong and diverse proxies as suggested in Section 4, I think we may be able to build some confidence that a) it is implausible for this training setup to teach obfuscated reasoning and b) this training setup will teach aligned reasoning/behavior/motivations that generalize at least moderately well. I think there’s quite a bit of room for advancing our understanding of how to do this well (and whether it’s possible), especially with experiments but maybe also with more conceptual arguments.
Drive by hot take, my apologies for not having yet read the actual post:
Desired behavior in training is itself just a proxy for what we really care about, which is desired behavior in the high-stakes deployment cases we really care about.
So yeah I don’t think there’s a strong principled reason to train on desired behavior but not on CoT or whatever. However, it’s very important that there be SOME held-out test set so to speak that lets you tell what the model is really thinking, that you don’t train on, and you don’t train on anything remotely similar to it either, so that you can make a decent case that the model hasn’t learned to fool your test, so that you can make a decent case that it’s a good case. And I think CoT is the best candidate to play that role, though I’m open to other suggestions. (And ideally we’d have multiple held-out test sets instead of just one tbf)
I partially agree: I think held-out tests are a great way to verify aligned cognition, especially if we never train on anything close to making that cognition look aligned. But also, it might be hard to get aligned cognition in high-stakes deployments if we never train on anything close to that. There are some unresolved tensions here within my post: Section 3 aligns with your point, while Sections 4 and 5 offer my best current ideas for getting the benefits of training without losing monitorability. (You can just read the section titles for the gist.)
If we use targeted interventions as suggested in Section 5, especially when using multiple strong and diverse proxies as suggested in Section 4, I think we may be able to build some confidence that a) it is implausible for this training setup to teach obfuscated reasoning and b) this training setup will teach aligned reasoning/behavior/motivations that generalize at least moderately well. I think there’s quite a bit of room for advancing our understanding of how to do this well (and whether it’s possible), especially with experiments but maybe also with more conceptual arguments.