Rauno Arike comments on Can LLMs learn Steganographic Reasoning via RL?

Rauno Arike 19 Apr 2025 21:41 UTC
2 points
0
This is valuable work! I’m curious about what you see as the most promising environments for the extension of training against LLM monitors in different settings. As you note, this should be done in more realistic reward hacking settings, but you also mention that the models’ ability to reward hack in the setting from Baker et al. doesn’t depend on how they use their CoT. Are you aware of any existing environments which satisfy both of these desiderata, being realistic and also sufficiently complex that LLMs must rely on their CoT in them? Aside from the Baker et al. environment, what are the environments that you’d be most excited to see follow-ups to your work use?