The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
[...]
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
(“Almost always” is because a key thing (that Buck didn’t specify) is that if you catch suspicious behavior you can investigate and/or train. I think the question of how useful it is to catch a large number of cases where the AI does this might be an improtant crux.)
Supposing it is the schelling mistake claim, I’m skeptical.
I also think that we could directly look for potential schelling mistake areas and investigate these specific points in more detail. This might mean the AI also needs to consistently find schelling mistakes which are non-obvious to humans or needs to be very robust to our more targeted techniques applied to the potential problem spots.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
(“Almost always” is because a key thing (that Buck didn’t specify) is that if you catch suspicious behavior you can investigate and/or train. I think the question of how useful it is to catch a large number of cases where the AI does this might be an improtant crux.)
Supposing it is the schelling mistake claim, I’m skeptical.
I also think that we could directly look for potential schelling mistake areas and investigate these specific points in more detail. This might mean the AI also needs to consistently find schelling mistakes which are non-obvious to humans or needs to be very robust to our more targeted techniques applied to the potential problem spots.
The latter.