My sense is that the control techniques we’ve talked about (e.g. these, and some other stuff that we’ve thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.
For example, consider the problem “I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I’m worried that occasionally on a subdomain of its choice it will blatantly lie to me”. If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.
But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that).
Strong disagree with this. Probably not the most cruxy thing for us, but I’ll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.
The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
Or, alternatively: the model “knows” what kind of argument would seem most natural to the user, and presents that kind of argument despite “knowing” that it systematically overlooks major problems.
Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there’s a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF’d and got positive feedback for matching supposed-experts’ answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I’m doing some research right now that will hopefully shed some light on this!
To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than “run it twice on the original question and a rephrasing/transformation of the question”.
The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
[...]
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
(“Almost always” is because a key thing (that Buck didn’t specify) is that if you catch suspicious behavior you can investigate and/or train. I think the question of how useful it is to catch a large number of cases where the AI does this might be an improtant crux.)
Supposing it is the schelling mistake claim, I’m skeptical.
I also think that we could directly look for potential schelling mistake areas and investigate these specific points in more detail. This might mean the AI also needs to consistently find schelling mistakes which are non-obvious to humans or needs to be very robust to our more targeted techniques applied to the potential problem spots.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
My sense is that the control techniques we’ve talked about (e.g. these, and some other stuff that we’ve thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.
For example, consider the problem “I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I’m worried that occasionally on a subdomain of its choice it will blatantly lie to me”. If you do no countermeasures, it seems easy for the AI to cause this problem. But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that). So I think that adopting well-tuned variations of basic techniques seems really promising.
Strong disagree with this. Probably not the most cruxy thing for us, but I’ll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.
The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
Or, alternatively: the model “knows” what kind of argument would seem most natural to the user, and presents that kind of argument despite “knowing” that it systematically overlooks major problems.
Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there’s a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF’d and got positive feedback for matching supposed-experts’ answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I’m doing some research right now that will hopefully shed some light on this!
To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than “run it twice on the original question and a rephrasing/transformation of the question”.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
(“Almost always” is because a key thing (that Buck didn’t specify) is that if you catch suspicious behavior you can investigate and/or train. I think the question of how useful it is to catch a large number of cases where the AI does this might be an improtant crux.)
Supposing it is the schelling mistake claim, I’m skeptical.
I also think that we could directly look for potential schelling mistake areas and investigate these specific points in more detail. This might mean the AI also needs to consistently find schelling mistakes which are non-obvious to humans or needs to be very robust to our more targeted techniques applied to the potential problem spots.
The latter.