Thanks, this makes it pretty clear to me how alignment could be fundamentally hard besides deception. (The problem seems to hold even if your values are actually pretty simple; e.g. if you’re a pure hedonistic utilitarian and you’ve magically solved deception, you can still fail at outer alignment by your AI optimizing for making it look like there’s more happiness and less suffering.)
Some (perhaps basic) notes to check that I’ve understood this properly:
The Bayes net running example per se isn’t really necessary for ELK to be a problem.
The basic problem is that in training, the AI can do just as well by reporting what a human would believe given their observations, and upon deployment in more complex tasks the report of what a human would believe can come apart from the “truth” (what the human would believe given arbitrary knowledge of the system).
This seems to crop up for a variety of models of AI and human cognition.
It seems like the game is stacked against “doing X” rather than “making it look like X” in many contexts, such that even with regularizers that push towards the latter, the overall inductive bias would plausibly still be towards the former. It’s just easier to make it look to humans like you’re creating a utopia than to do all the complex work of utopia-building.
I suspect this would hold even for much less ambitious yet still superhuman tasks, such that deferring to future human-level aligned AIs wouldn’t be sufficient.
But, if we train a reporter module, reporting what the human would believe doesn’t seem prima facie easier than reporting the truth in this way. So that’s why we might reasonably hope a good regularizer can break the tie.
In the build-break loop examples in the report, we’re generously assuming the human overseers know the relevant set of questions to ask to check if there’s malfeasance going on. And that this set isn’t so hopelessly large that iterating through it for training is too slow.
In the imitative generalization example, it seems like besides the problem that the output Bayes net may be ontologically incomprehensible to humans, the training process requires humans to understand all the relevant hypotheses and data (to report their priors and likelihoods). This may be a general confusion about imitative generalization on my part.
If we tried distillation to get around the prohibitive slowness of amplification for the “AI science” proposal, that would introduce both inner alignment problems and perhaps bring us to the same sort of “alien ontology” problem as the imitative generalization proposal.
The ontology mismatch problem isn’t just a possibility, it seems pretty likely by default, for reasons summarized in the plot of model interpretability here.
Intuitively, the ontology/primitive concepts that quantum physicists use to make very excellent predictions about the universe—better than I could make, certainly—are alien to me, and to anyone else who hasn’t spent a lot of time learning quantum physics. This is consistent with human-interpretable concepts being more prevalent in recent powerful language models than in early-2010s neural networks.
Deferring to future human-level aligned AIs isn’t sufficient because even if we had many more human-level minds giving feedback to superhuman AIs, they would still be faced with ELK too. i.e., This doesn’t seem to be a problem that can be solved just by parallelizing across more overseers than we currently have, although having aligned assistants could of course still help with ELK research.
Thanks, this makes it pretty clear to me how alignment could be fundamentally hard besides deception. (The problem seems to hold even if your values are actually pretty simple; e.g. if you’re a pure hedonistic utilitarian and you’ve magically solved deception, you can still fail at outer alignment by your AI optimizing for making it look like there’s more happiness and less suffering.)
Some (perhaps basic) notes to check that I’ve understood this properly:
The Bayes net running example per se isn’t really necessary for ELK to be a problem.
The basic problem is that in training, the AI can do just as well by reporting what a human would believe given their observations, and upon deployment in more complex tasks the report of what a human would believe can come apart from the “truth” (what the human would believe given arbitrary knowledge of the system).
This seems to crop up for a variety of models of AI and human cognition.
It seems like the game is stacked against “doing X” rather than “making it look like X” in many contexts, such that even with regularizers that push towards the latter, the overall inductive bias would plausibly still be towards the former. It’s just easier to make it look to humans like you’re creating a utopia than to do all the complex work of utopia-building.
I suspect this would hold even for much less ambitious yet still superhuman tasks, such that deferring to future human-level aligned AIs wouldn’t be sufficient.
But, if we train a reporter module, reporting what the human would believe doesn’t seem prima facie easier than reporting the truth in this way. So that’s why we might reasonably hope a good regularizer can break the tie.
In the build-break loop examples in the report, we’re generously assuming the human overseers know the relevant set of questions to ask to check if there’s malfeasance going on. And that this set isn’t so hopelessly large that iterating through it for training is too slow.
In the imitative generalization example, it seems like besides the problem that the output Bayes net may be ontologically incomprehensible to humans, the training process requires humans to understand all the relevant hypotheses and data (to report their priors and likelihoods). This may be a general confusion about imitative generalization on my part.
If we tried distillation to get around the prohibitive slowness of amplification for the “AI science” proposal, that would introduce both inner alignment problems and perhaps bring us to the same sort of “alien ontology” problem as the imitative generalization proposal.
The ontology mismatch problem isn’t just a possibility, it seems pretty likely by default, for reasons summarized in the plot of model interpretability here.
Intuitively, the ontology/primitive concepts that quantum physicists use to make very excellent predictions about the universe—better than I could make, certainly—are alien to me, and to anyone else who hasn’t spent a lot of time learning quantum physics. This is consistent with human-interpretable concepts being more prevalent in recent powerful language models than in early-2010s neural networks.
Deferring to future human-level aligned AIs isn’t sufficient because even if we had many more human-level minds giving feedback to superhuman AIs, they would still be faced with ELK too. i.e., This doesn’t seem to be a problem that can be solved just by parallelizing across more overseers than we currently have, although having aligned assistants could of course still help with ELK research.