In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.
I’m going to get on that.
In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.