You can’t just count the number of things when some of them are much more probable than others.
I mean, you can. You just multiply the number by the relative probability of the event. Like, in finite event spaces, counting is how you do any likelihood analysis.
I am personally not a huge fan of the term “counting argument”, so I am not very attached to it. Neither are Evan, or Wentworth, or my guess Joe or Eliezer who have written about arguments in this space in the past. Of course, you always need some kind of measure. The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words “most goals are misaligned”.
The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words “most goals are misaligned”.
I think it would be clarifying to try to talk about something other than “human values.” On the If Anyone Builds It resource page for the frequently asked question “Why would an AI steer toward anything other than what it was trained to steer toward?”, the subheader answers, “Because there are many ways to perform well in training.” That’s the class of argument that I understand Pope and Belrose to be critiquing: that deep learning systems can’t be given goals, because all sorts of goals could behave as desired on the training distribution, and almost all of them would do something different outside the training distribution. (Note that “human values” does not appear in the question.)
But the empirical situation doesn’t seem as dire as that kind of “counting argument” suggests. In Langosco et al. 2023′s “Goal Misgeneralization in Deep Reinforcement Learning”, an RL policy trained to collect a coin that was always at the right edge of a video game level learned to go right rather than collect the coin—but randomizing the position of the coin in just a couple percent of training episodes fixed the problem.
It’s not a priori obvious that would work! You could imagine that the policy would learn some crazy thing that happened to match the randomized examples, but didn’t generalize to coin-seeking. Indeed, there are astronomically many such crazy things—but they would seem to be more complicated than the intended generalization of coin-seeking. The counting argument of the form, “You don’t get what you train for, because there are many ways to perform well in training” doesn’t seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
The counting argument of the form, “You don’t get what you train for, because there are many ways to perform well in training” doesn’t seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
I mean, why? Weighted by any reasonable attempt I have for operationalizing the neural network prior, you don’t get what you want out of training because the number of low-loss, high-prior generalizations that do not generalize the way I want vastly outweigh the number of low-loss, high-prior generalizations that do generalize the way I want (you might “get what you train for” because I don’t really know what that means).
I agree that it’s not necessary to talk about “human values” in this context. You might want to get something out of your AI systems that is closer to corrigibility, or some kind of deference, etc. However, that also doesn’t fall nicely out of any analysis of the neural network prior and associated training dynamics either.
The question is why that argument doesn’t rule out all the things we do successfully use deep learning for. Do image classification, or speech synthesis, or helpful assistants that speak natural language and know everything on the internet “fall nicely out of any analysis of the neural network prior and associated training dynamics”? These applications are only possible because generalization often works out in our favor. (For example, LLM assistants follow instructions that they haven’t seen before, and can even follow instructions in other languages despite the instruction-tuning data being in English.)
Again, obviously that doesn’t mean superintelligence won’t kill the humans for any number of otherreasons that we’ve both read many hundreds of thousands of words about. But in order to convince people not to build it, we want to use the best, most convincing arguments, and “you don’t get what you want out of training” as a generic objection to deep learning isn’t very convincing if it proves too much.
I mean, you can. You just multiply the number by the relative probability of the event. Like, in finite event spaces, counting is how you do any likelihood analysis.
I am personally not a huge fan of the term “counting argument”, so I am not very attached to it. Neither are Evan, or Wentworth, or my guess Joe or Eliezer who have written about arguments in this space in the past. Of course, you always need some kind of measure. The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words “most goals are misaligned”.
I think it would be clarifying to try to talk about something other than “human values.” On the If Anyone Builds It resource page for the frequently asked question “Why would an AI steer toward anything other than what it was trained to steer toward?”, the subheader answers, “Because there are many ways to perform well in training.” That’s the class of argument that I understand Pope and Belrose to be critiquing: that deep learning systems can’t be given goals, because all sorts of goals could behave as desired on the training distribution, and almost all of them would do something different outside the training distribution. (Note that “human values” does not appear in the question.)
But the empirical situation doesn’t seem as dire as that kind of “counting argument” suggests. In Langosco et al. 2023′s “Goal Misgeneralization in Deep Reinforcement Learning”, an RL policy trained to collect a coin that was always at the right edge of a video game level learned to go right rather than collect the coin—but randomizing the position of the coin in just a couple percent of training episodes fixed the problem.
It’s not a priori obvious that would work! You could imagine that the policy would learn some crazy thing that happened to match the randomized examples, but didn’t generalize to coin-seeking. Indeed, there are astronomically many such crazy things—but they would seem to be more complicated than the intended generalization of coin-seeking. The counting argument of the form, “You don’t get what you train for, because there are many ways to perform well in training” doesn’t seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
I mean, why? Weighted by any reasonable attempt I have for operationalizing the neural network prior, you don’t get what you want out of training because the number of low-loss, high-prior generalizations that do not generalize the way I want vastly outweigh the number of low-loss, high-prior generalizations that do generalize the way I want (you might “get what you train for” because I don’t really know what that means).
I agree that it’s not necessary to talk about “human values” in this context. You might want to get something out of your AI systems that is closer to corrigibility, or some kind of deference, etc. However, that also doesn’t fall nicely out of any analysis of the neural network prior and associated training dynamics either.
The question is why that argument doesn’t rule out all the things we do successfully use deep learning for. Do image classification, or speech synthesis, or helpful assistants that speak natural language and know everything on the internet “fall nicely out of any analysis of the neural network prior and associated training dynamics”? These applications are only possible because generalization often works out in our favor. (For example, LLM assistants follow instructions that they haven’t seen before, and can even follow instructions in other languages despite the instruction-tuning data being in English.)
Again, obviously that doesn’t mean superintelligence won’t kill the humans for any number of other reasons that we’ve both read many hundreds of thousands of words about. But in order to convince people not to build it, we want to use the best, most convincing arguments, and “you don’t get what you want out of training” as a generic objection to deep learning isn’t very convincing if it proves too much.