The counting argument of the form, “You don’t get what you train for, because there are many ways to perform well in training” doesn’t seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
I mean, why? Weighted by any reasonable attempt I have for operationalizing the neural network prior, you don’t get what you want out of training because the number of low-loss, high-prior generalizations that do not generalize the way I want vastly outweigh the number of low-loss, high-prior generalizations that do generalize the way I want (you might “get what you train for” because I don’t really know what that means).
I agree that it’s not necessary to talk about “human values” in this context. You might want to get something out of your AI systems that is closer to corrigibility, or some kind of deference, etc. However, that also doesn’t fall nicely out of any analysis of the neural network prior and associated training dynamics either.
The question is why that argument doesn’t rule out all the things we do successfully use deep learning for. Do image classification, or speech synthesis, or helpful assistants that speak natural language and know everything on the internet “fall nicely out of any analysis of the neural network prior and associated training dynamics”? These applications are only possible because generalization often works out in our favor. (For example, LLM assistants follow instructions that they haven’t seen before, and can even follow instructions in other languages despite the instruction-tuning data being in English.)
Again, obviously that doesn’t mean superintelligence won’t kill the humans for any number of otherreasons that we’ve both read many hundreds of thousands of words about. But in order to convince people not to build it, we want to use the best, most convincing arguments, and “you don’t get what you want out of training” as a generic objection to deep learning isn’t very convincing if it proves too much.
I mean, why? Weighted by any reasonable attempt I have for operationalizing the neural network prior, you don’t get what you want out of training because the number of low-loss, high-prior generalizations that do not generalize the way I want vastly outweigh the number of low-loss, high-prior generalizations that do generalize the way I want (you might “get what you train for” because I don’t really know what that means).
I agree that it’s not necessary to talk about “human values” in this context. You might want to get something out of your AI systems that is closer to corrigibility, or some kind of deference, etc. However, that also doesn’t fall nicely out of any analysis of the neural network prior and associated training dynamics either.
The question is why that argument doesn’t rule out all the things we do successfully use deep learning for. Do image classification, or speech synthesis, or helpful assistants that speak natural language and know everything on the internet “fall nicely out of any analysis of the neural network prior and associated training dynamics”? These applications are only possible because generalization often works out in our favor. (For example, LLM assistants follow instructions that they haven’t seen before, and can even follow instructions in other languages despite the instruction-tuning data being in English.)
Again, obviously that doesn’t mean superintelligence won’t kill the humans for any number of other reasons that we’ve both read many hundreds of thousands of words about. But in order to convince people not to build it, we want to use the best, most convincing arguments, and “you don’t get what you want out of training” as a generic objection to deep learning isn’t very convincing if it proves too much.