Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you’re testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)
Is that actually true though? Why is that true? Say we are training the model on a dataset of N human answers, and then we are doing to deploy it to answer 10N more questions, all from the same big pool of questions. The AI can’t tell whether it is in training or deployment, but it could decide to follow a policy of giving some sort of catastrophic answer with probability 1/10N, so that probably it’ll make it through training just fine and then still get to cause catastrophe.
That’s right—you still only get a bound on average quality, and you need to do something to cope with failures so rare they never appear in training (here’s a post reviewing my best guesses).
But before you weren’t even in the game, it wouldn’t matter how well adversarial training worked because you didn’t even have the knowledge to tell whether a given behavior is good or bad. You weren’t even getting the right behavior on average.
(In the OP I think the claim “the generalization is now coming entirely from human beliefs” is fine, I meant generalization from one distribution to another. “Neural nets are are fine” was sweeping these issues under the rug. Though note that in the real world the distribution will change from neural net training to deployment, it’s just exactly the normal robustness problem. The point of this post is just to get it down to only a robustness problem that you could solve with some kind of generalization of adversarial training, the reason to set it up as in the OP was to make the issue more clear.)
I agree with Daniel. Certainly training on actual iid samples from the deployment distribution helps a lot—as it ensures that your limiting behavior is correct—but in the finite data regime you can still find a deceptive model that defects some percentage of the time.
This is a good question, and I don’t know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.
Yeah, that’s my view.
Thanks for confirming.