Model providers often don’t provide the full CoT, and instead provide a summary. I think this is a fine/good thing to do to help prevent distillation.
However, I think it would be good if the summaries provided a flag for when the CoT contained evaluation awareness or scheming (or other potentially concerning behavior).
I worry that currently the summaries don’t really provide this information, and this probably makes alignment and capability evaluations less valid.
Another reason labs don’t provide CoT is that if users see them, the labs will be incentivized to optimize for them, and this will decrease their informativeness. A flag like you propose would have a similar effect.
Great point! This possibly makes my proposal a Bad idea. I would need to know more about how the labs respond to this kind of incentive to actually know.
OpenAI said that they don’t want to train on the CoT. Given information about whether or not the CoT contains scheming to the user that presses buttons that affect training is training based on the CoT.
I think you can argue about the merits of “don’t train based on CoT” but it seems to be one of the few free safety relevant decision where OpenAI had a safety idea and manages to actually execute it.
Model providers often don’t provide the full CoT, and instead provide a summary. I think this is a fine/good thing to do to help prevent distillation.
However, I think it would be good if the summaries provided a flag for when the CoT contained evaluation awareness or scheming (or other potentially concerning behavior).
I worry that currently the summaries don’t really provide this information, and this probably makes alignment and capability evaluations less valid.
Another reason labs don’t provide CoT is that if users see them, the labs will be incentivized to optimize for them, and this will decrease their informativeness. A flag like you propose would have a similar effect.
Great point! This possibly makes my proposal a Bad idea. I would need to know more about how the labs respond to this kind of incentive to actually know.
Labs can provide this kind of information to evaluators instead, so that they don’t have to optimize the CoT for the public.
OpenAI said that they don’t want to train on the CoT. Given information about whether or not the CoT contains scheming to the user that presses buttons that affect training is training based on the CoT.
I think you can argue about the merits of “don’t train based on CoT” but it seems to be one of the few free safety relevant decision where OpenAI had a safety idea and manages to actually execute it.