peterbarnett comments on peterbarnett’s Shortform

peterbarnett 21 Nov 2025 18:01 UTC
11 points
3
Model providers often don’t provide the full CoT, and instead provide a summary. I think this is a fine/good thing to do to help prevent distillation.
However, I think it would be good if the summaries provided a flag for when the CoT contained evaluation awareness or scheming (or other potentially concerning behavior).
I worry that currently the summaries don’t really provide this information, and this probably makes alignment and capability evaluations less valid.
- Nisan 21 Nov 2025 20:16 UTC
  20 points
  7
  Parent
  Another reason labs don’t provide CoT is that if users see them, the labs will be incentivized to optimize for them, and this will decrease their informativeness. A flag like you propose would have a similar effect.
  - peterbarnett 21 Nov 2025 20:38 UTC
    6 points
    3
    Parent
    Great point! This possibly makes my proposal a Bad idea. I would need to know more about how the labs respond to this kind of incentive to actually know.
    - khang200923 22 Nov 2025 14:31 UTC
      1 point
      0
      Parent
      Labs can provide this kind of information to evaluators instead, so that they don’t have to optimize the CoT for the public.
- ChristianKl 22 Nov 2025 2:44 UTC
  3 points
  0
  Parent
  OpenAI said that they don’t want to train on the CoT. Given information about whether or not the CoT contains scheming to the user that presses buttons that affect training is training based on the CoT.
  I think you can argue about the merits of “don’t train based on CoT” but it seems to be one of the few free safety relevant decision where OpenAI had a safety idea and manages to actually execute it.