When the model organism is trained from a base model which includes thinking (e.g. Qwen3-14b), it often fails to use thinking blocks properly. The AuditBench ‘defer to users’ model’s correct-thinking rate drops from 100% to 56%.
The AuditBench models were specifically trained without reasoning. We didn’t really have good ideas for how to train reasoning model organisms, since we wanted our models to be robust to attempts to get them to spill their secrets which is harder to do with thinking enabled. We created a new chat template for Qwen without thinking tags and trained models in that format, which probably explains the degradation that you see here. We didn’t end up using the Qwen models for much, and all of our core result were on the Llama model organisms.
I’d be somewhat surprised if you saw similar failures in other thinking model organisms, like the one from this paper.
I’ve found that doing no-reasoning SFT (train on output conditioned on empty reasoning) on reasoning models (like they do in Emil’s paper that you linked) can cook them quite badly. I wasn’t able to prevent the model from completely losing it’s reasoning ability when doing full-weight finetuning on qwen3-32b, and it sometimes reasons weirdly even when trained with LoRA, tho there seems to be some intra-run variance here (it sometimes comes out completely normal).
The AuditBench models were specifically trained without reasoning. We didn’t really have good ideas for how to train reasoning model organisms, since we wanted our models to be robust to attempts to get them to spill their secrets which is harder to do with thinking enabled. We created a new chat template for Qwen without thinking tags and trained models in that format, which probably explains the degradation that you see here. We didn’t end up using the Qwen models for much, and all of our core result were on the Llama model organisms.
I’d be somewhat surprised if you saw similar failures in other thinking model organisms, like the one from this paper.
I’ve found that doing no-reasoning SFT (train on output conditioned on empty reasoning) on reasoning models (like they do in Emil’s paper that you linked) can cook them quite badly. I wasn’t able to prevent the model from completely losing it’s reasoning ability when doing full-weight finetuning on qwen3-32b, and it sometimes reasons weirdly even when trained with LoRA, tho there seems to be some intra-run variance here (it sometimes comes out completely normal).