Emil Ryd

Karma: 107

Emil Ryd 16 Feb 2026 3:30 UTC
2 points
0
in reply to: Caleb Biddulph’s comment on: LLMs struggle to verbalize their internal reasoning
Cool paper, I hadn’t seen it yet!

IIUC, in your paper you try to extract the internal reasoning an LLM does by optimizing a prompt, but you don’t actually get the LLM itself to verbalize its reasoning. This is cool, but it’s not really checking whether the model is able to verbalize its internal processes (i.e. self-identify and describe its reward hacks), right?

I agree that an interesting test would be whether a model that has been trained to do the tricks/tasks would be better at optimizing the prompts. It’s leveraging a similar idea to non-assistant auditing techinques.

Emil Ryd 16 Feb 2026 3:29 UTC
2 points
0
in reply to: StanislavKrym’s comment on: LLMs struggle to verbalize their internal reasoning
This is a cool proposal! I agree with Caleb below that a potential issue is that the model just outputs its final answer in its CoT, and so it may not require much reasoning. I’m not really sure how to fix this. You could do something like what Caleb proposes, where you just prompt the trained model to output a ruleset that the untrained model gets to see in context when trying to solve the problem (but which is not specific to the individual question, and so can’t give away specific answers). If the model really has no verbalization ability at the start however, this might be hard to get working at the start.

Emil Ryd 16 Feb 2026 3:02 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: LLMs struggle to verbalize their internal reasoning
I agree that training on more LLMs would be good. However, I would like to note that gpt-oss-20b, one of the models we train on is a sparse, MoE reasoning model, released half a year ago.

Emil Ryd 10 Dec 2025 18:12 UTC
1 point
0
on: Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Recontextualization mitigates deception and achieves the highest ground truth performance. We evaluate on Neutral instructions for 3 training seeds and plot standard error. “Baseline’” is the pre-GRPO checkpoint. Models without a KL coefficient label use $β = 0.1$
I’m surprised by these results in the GRPO setting. If I understand your setting correctly, I would think of the purple “Lie” bar as the moral equivalent of inoculation prompting in the RL setting. And hence, I would expect it to do better than Honest and Neutral at evaluation time. But that doesn’t seem to be the case. Do you have any intuition for why this is the case?
Relatedly, while Honest → Neutral recontextualization seems to be the best, Honest → Lie, and Neutral → Lie is a lot worse (and in your “Strong lie detector results”), going from Neutral → Lie seems worse even than just sticking to Neutral. So, there seems like RL-ing the model with a “Lie” prompt (even if another prompt was used for generation) is generally bad, even though we don’t see these effects in SFT. Do you have any idea why?
This is a super cool post, and I’m really glad you guys studied this! I would be very curious to get your thoughts (although I realize I’m a bit late to the party on this post).