Nice post, I think detecting hidden reasoning is an important direction!
I find it helpful in the context of this post to distinguish between (1) “hidden reasoning” in the sense that the model is deliberately confining some reasoning to forward passes and doesn’t include any traces of it in the CoT except for the conclusion, vs (2) in the steganography sense, where the tokens in the CoT are carrying important information (but in some encoded way).
IIUC, your datasets are meant to measure (1) (though the methods could in principle be applicable to both (1) and (2)), does that sound right? That could be one motivation for your approach over paraphrasing.
If you do want to study (1), maybe a more realistic proxy would be cases where current models hide reasoning in this sense (whether naturally or after prompting). I.e. pick a task where models need to do some reasoning, but it’s easy enough that they can do it in a single forward pass. Make the task such that models want to hide this reasoning (e.g. it’s about bias), or prompt them to hide it.
My guess is that this proxy of hidden reasoning would be harder to detect than the “removing reasoning chunks” proxy. And I think this is likely also true for the real case of (1)---if a scheming model was already deliberately not mentioning certain things in its CoT, it could probably also take care to make the CoT still appear more or less logical.
At 20% ablation, the monitor guessed that there were skipped steps only 31% of the time (true-positives)
Minor note: do all CoTs have at least one step removed in the 20% ablation setting? Earlier on, it sounded like you’re randomly deciding whether to remove each line of the CoT. If a CoT is short, some of them might by chance not have any lines removed, unless I’m misunderstanding the dataset creation? (Maybe this effect basically doesn’t matter though because all CoTs are much longer than 5 lines?)
Nice post, I think detecting hidden reasoning is an important direction!
I find it helpful in the context of this post to distinguish between (1) “hidden reasoning” in the sense that the model is deliberately confining some reasoning to forward passes and doesn’t include any traces of it in the CoT except for the conclusion, vs (2) in the steganography sense, where the tokens in the CoT are carrying important information (but in some encoded way).
IIUC, your datasets are meant to measure (1) (though the methods could in principle be applicable to both (1) and (2)), does that sound right? That could be one motivation for your approach over paraphrasing.
If you do want to study (1), maybe a more realistic proxy would be cases where current models hide reasoning in this sense (whether naturally or after prompting). I.e. pick a task where models need to do some reasoning, but it’s easy enough that they can do it in a single forward pass. Make the task such that models want to hide this reasoning (e.g. it’s about bias), or prompt them to hide it.
My guess is that this proxy of hidden reasoning would be harder to detect than the “removing reasoning chunks” proxy. And I think this is likely also true for the real case of (1)---if a scheming model was already deliberately not mentioning certain things in its CoT, it could probably also take care to make the CoT still appear more or less logical.
Minor note: do all CoTs have at least one step removed in the 20% ablation setting? Earlier on, it sounded like you’re randomly deciding whether to remove each line of the CoT. If a CoT is short, some of them might by chance not have any lines removed, unless I’m misunderstanding the dataset creation? (Maybe this effect basically doesn’t matter though because all CoTs are much longer than 5 lines?)