Florian_Dietz comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz 20 Jan 2026 21:35 UTC
1 point
0
I’m going to get on that.

In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.