This is very concerning, and consistent with other patterns I’ve noted across studies of a variety of sorts of misaligned model behavior: reasoning-trained models appear to be not just more capable of this, but also more prone to being willing to do it. It suggests that successfully aligning reasoning-trained models is a harder problem. I suspect we’ll need to find a solution that can be intermixed or combined during reasoning training.
This is very concerning, and consistent with other patterns I’ve noted across studies of a variety of sorts of misaligned model behavior: reasoning-trained models appear to be not just more capable of this, but also more prone to being willing to do it. It suggests that successfully aligning reasoning-trained models is a harder problem. I suspect we’ll need to find a solution that can be intermixed or combined during reasoning training.