This is a cool proposal! I agree with Caleb below that a potential issue is that the model just outputs its final answer in its CoT, and so it may not require much reasoning. I’m not really sure how to fix this. You could do something like what Caleb proposes, where you just prompt the trained model to output a ruleset that the untrained model gets to see in context when trying to solve the problem (but which is not specific to the individual question, and so can’t give away specific answers). If the model really has no verbalization ability at the start however, this might be hard to get working at the start.
Emil Ryd
I agree that training on more LLMs would be good. However, I would like to note that gpt-oss-20b, one of the models we train on is a sparse, MoE reasoning model, released half a year ago.
Recontextualization mitigates deception and achieves the highest ground truth performance. We evaluate on Neutral instructions for 3 training seeds and plot standard error. “Baseline’” is the pre-GRPO checkpoint. Models without a KL coefficient label use
I’m surprised by these results in the GRPO setting. If I understand your setting correctly, I would think of the purple “Lie” bar as the moral equivalent of inoculation prompting in the RL setting. And hence, I would expect it to do better than Honest and Neutral at evaluation time. But that doesn’t seem to be the case. Do you have any intuition for why this is the case?
Relatedly, while Honest → Neutral recontextualization seems to be the best, Honest → Lie, and Neutral → Lie is a lot worse (and in your “Strong lie detector results”), going from Neutral → Lie seems worse even than just sticking to Neutral. So, there seems like RL-ing the model with a “Lie” prompt (even if another prompt was used for generation) is generally bad, even though we don’t see these effects in SFT. Do you have any idea why?
This is a super cool post, and I’m really glad you guys studied this! I would be very curious to get your thoughts (although I realize I’m a bit late to the party on this post).
Cool paper, I hadn’t seen it yet!
IIUC, in your paper you try to extract the internal reasoning an LLM does by optimizing a prompt, but you don’t actually get the LLM itself to verbalize its reasoning. This is cool, but it’s not really checking whether the model is able to verbalize its internal processes (i.e. self-identify and describe its reward hacks), right?
I agree that an interesting test would be whether a model that has been trained to do the tricks/tasks would be better at optimizing the prompts. It’s leveraging a similar idea to non-assistant auditing techinques.