Observation: Detecting unfaithful CoT here seems to require generating many hypotheses (“cues”) about what underlying factors might influence the model’s reasoning.
Is there a less-supervised way of doing this? Some ideas:
Can we “just ask” models whether they have unfaithful CoT? This seems to be implied by introspection and related phenomena
thanks! heres my initial thought about introspection and how to improve on the setup there:
in my introspection paper we train models to predict their behavior in a single forward pass without CoT.
maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.
still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves?
my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.
Observation: Detecting unfaithful CoT here seems to require generating many hypotheses (“cues”) about what underlying factors might influence the model’s reasoning.
Is there a less-supervised way of doing this? Some ideas:
Can we “just ask” models whether they have unfaithful CoT? This seems to be implied by introspection and related phenomena
Can we use black-box lie detection to detect unfaithful CoT?
thanks! heres my initial thought about introspection and how to improve on the setup there:
in my introspection paper we train models to predict their behavior in a single forward pass without CoT.
maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.
still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves?
my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.