RogerDearnaley comments on Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

RogerDearnaley 16 May 2025 0:30 UTC
2 points
0
This is very concerning, and consistent with other patterns I’ve noted across studies of a variety of sorts of misaligned model behavior: reasoning-trained models appear to be not just more capable of this, but also more prone to being willing to do it. It suggests that successfully aligning reasoning-trained models is a harder problem. I suspect we’ll need to find a solution that can be intermixed or combined during reasoning training.