James Chua comments on Harmless reward hacks can generalize to misalignment in LLMs

James Chua 29 Aug 2025 17:16 UTC
2 points
0
We test on GPT-4.1. Which I think is frontier-ish at least 4 months ago.
I agree with the principle with testing more models. I’m most interested in RL environments!