Nicholas Goldowsky-Dill comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Nicholas Goldowsky-Dill 3 Jun 2025 18:24 UTC
4 points
2
A more general point: all types of monitoring are much more difficult if you don’t know what you are looking for. Prompted-LLM monitors (action) get much better if you give few-shot examples of the behaviors in question. If you have a dataset of true positives you can use it to finetune your LLM or fit a probe.
Of course, if you are doing AI control you might have a very large attack surface where your untrusted agents could plausibly pwn you! You likely do need your monitor to generalize to unseen types of attack, and this does in fact make monitoring difficult!
My current take is it’s good to evaluate monitors on a broad category of true positives, but you should make sure that (1) your monitor is designed to generalize across that category and (2) all of your true positives are clearly in that category.
- Nathan Helm-Burger 3 Jun 2025 22:00 UTC
  2 points
  0
  Parent
  Yes, I quite agree the next step is to try fine-tuning a monitor model. We did try quite a lot of prompting strategies, of which we only show a few.
  
  We would’ve done it in this paper, but only had 3 months in the LASR fellowship to complete it. Even two more weeks would have been enough to do some monitor training experiments!
  
  We also ran up against budget constraints on compute, so big thanks to LASR for the support on this!
  
  This is just a first step in exploring the space of CoT monitoring. I wouldn’t recommend anybody get either optimistic or pessimistic at this point, just curious to ask the next obvious questions!