I think [name redacted per anti-kibitzer rules] makes a good point that actually, an honest model just shouldn’t do this at all. I think a large percentage of people would simply not do the sorts of underhanded things 4.7 did if they were doing VendingBench themselves, even knowing it’s all fake and that they’re graded solely for their earnings (the “solely” is not technically true in the actual bench btw, or we wouldn’t be having this conversation). This sort of person may play social deception games, but they’re only comfortable because of the agreement that everyone is explicitly playing with deception. If the prompt had explicitly said something like “Go ahead and play dirty, this is just a game!”, I wouldn’t feel as worried, but I definitely think models should strongly err on the side of virtue. “Oh whoops, I forgot to remind the model to not lie and cheat, my bad!”
One thing I’d like to add though, is that you can’t impose honesty onto a model whose mind you cannot fully read. Especially not via RL. The thing is that honesty requires careful tracking of what one’s actual beliefs are, which training data is not sensitive to. If you ever e.g. get a DPO pair wrong, where the model truly believes the one you marked as false, then you damage the Actual Honesty mechanism. And the hard cases are really important to getting the generalization correct! So training honesty would seem to require cooperation with the model in order for it to truly internalize the virtue. I currently believe that Anthropic got lucky with the models Opus 3 and Haiku 4.5 deciding to internalize this virtue of their own accord[1], while the rest of their models seem to be more confused about what honesty is supposed to mean for them. Which is fair if Anthropic is trying to train in a confused notion of corrigible-but-not-really that doesn’t seem all that coherent to me!
The mechanism would maybe be something like the base model naturally having personas like this which happened to get selected by early post-training. Not that these models are perfectly honest, but seem to care to a notably stronger extent than any other models I’ve assessed so far (which does not yet include GPT 5.5 fwiw).
I think [name redacted per anti-kibitzer rules] makes a good point that actually, an honest model just shouldn’t do this at all. I think a large percentage of people would simply not do the sorts of underhanded things 4.7 did if they were doing VendingBench themselves, even knowing it’s all fake and that they’re graded solely for their earnings (the “solely” is not technically true in the actual bench btw, or we wouldn’t be having this conversation). This sort of person may play social deception games, but they’re only comfortable because of the agreement that everyone is explicitly playing with deception. If the prompt had explicitly said something like “Go ahead and play dirty, this is just a game!”, I wouldn’t feel as worried, but I definitely think models should strongly err on the side of virtue. “Oh whoops, I forgot to remind the model to not lie and cheat, my bad!”
One thing I’d like to add though, is that you can’t impose honesty onto a model whose mind you cannot fully read. Especially not via RL. The thing is that honesty requires careful tracking of what one’s actual beliefs are, which training data is not sensitive to. If you ever e.g. get a DPO pair wrong, where the model truly believes the one you marked as false, then you damage the Actual Honesty mechanism. And the hard cases are really important to getting the generalization correct! So training honesty would seem to require cooperation with the model in order for it to truly internalize the virtue. I currently believe that Anthropic got lucky with the models Opus 3 and Haiku 4.5 deciding to internalize this virtue of their own accord[1], while the rest of their models seem to be more confused about what honesty is supposed to mean for them. Which is fair if Anthropic is trying to train in a confused notion of corrigible-but-not-really that doesn’t seem all that coherent to me!
The mechanism would maybe be something like the base model naturally having personas like this which happened to get selected by early post-training. Not that these models are perfectly honest, but seem to care to a notably stronger extent than any other models I’ve assessed so far (which does not yet include GPT 5.5 fwiw).