I would have thought you’d prefer to have models learn to leverage CoT reasoning without any pressure to make it look more harmless than it would be by default.
Yeah this is interesting. However, I think it’s unclear how HHH training (before any reasoning training) affects the faithfulness of CoT. I think it’s reasonably likely if you apply your HHH training before reasoning, and you apply no optimization pressure on the CoT as you train reasoning, that you end up with scratchpads that are fairly uncensored (since you let your HHH model thoroughly explore the use of its scratchpad without penalization). On the other hand, directly intervening with the external reasoning process to make it more human-friendly seems very likely to obfuscate harmful reasoning traces.
It also looks like providers are already providing some mix of both with cold-start SFT (as noted in comment below) so it would be nice to know if one is more harmful than the other wrt monitorability.
Yeah this is interesting. However, I think it’s unclear how HHH training (before any reasoning training) affects the faithfulness of CoT. I think it’s reasonably likely if you apply your HHH training before reasoning, and you apply no optimization pressure on the CoT as you train reasoning, that you end up with scratchpads that are fairly uncensored (since you let your HHH model thoroughly explore the use of its scratchpad without penalization). On the other hand, directly intervening with the external reasoning process to make it more human-friendly seems very likely to obfuscate harmful reasoning traces.
It also looks like providers are already providing some mix of both with cold-start SFT (as noted in comment below) so it would be nice to know if one is more harmful than the other wrt monitorability.