There’s a lot I don’t like about this post (trying to do away with the principle of indifference or goals is terrible greedy reductionism), but the core point, that goal counting arguments of the form “many goals could perform well on the training set, so you’ll probably get the wrong one” seem to falsely imply that neural networks shouldn’t generalize (because many functions could perform well on the training set), seems so important and underappreciated that I might have to give it a very grudging +9. (Update: downgraded to +1 in light of Steven Byrnes’s comment.)
In the comments, Evan Hubinger and John Wentworth argue for a corrected goal-counting argument, but in bothcases, I don’t get it: it seems to me that simpler goals should be favored, rather than choosing randomly from a large space of equally probable goals. (This doesn’t solve the alignment problem, because the simplest generalization need not be the one we want, per “List of Lethalities” #20.)
I am painfully aware that the problem might be on my end. (I’m saying “I don’t get it”, not “This is wrong.”) Could someone help me out here? What does the correct goal-counting argument look like?
both counting arguments involve an inference from “there are ‘more’ networks with property X” to “networks are likely to have property X.”
(I might be missing where you’re coming from, but worth a shot.)
The counting argument and the simplicity argument are two sides of the same coin.
The counting perspective says: {“The goal is AARDVARK”, “The goal is ABACUS”, …} is a much longer list of things than {“The goal is HELPFULNESS”}.
The simplicity perspective says: “The goal is” is a shorter string than “The goal is helpfulness”.
(…And then what happens with “The goal is”? Well, let’s say it reads whatever uninitialized garbage RAM bits happen to be in the slots after the word “is”, and that random garbage is its goal.)
Any sensible prior over bit-strings will wind up with some equivalence between the simplicity perspective and the counting perspective, because simpler specifications leave extra bits that can be set arbitrarily, as Evan discussed.
And so I think I think OP is just totally wrong that either the counting perspective or the simplicity perspective (which, again, are equivalent) predict that NNs shouldn’t generalize, per Evan’s comment.
Your comment here says “simpler goals should be favored”, which is fine, but “pick the goal randomly at initialization” is actually a pretty simple spec. Alternatively, if we were going to pick “the one simplest possible goal” (like, I dunno, “the goal is blackness”?), then this goal would obviously be egregiously misaligned, and if I were going to casually explain to a friend why the one simplest possible goal would obviously be egregiously misaligned, I would use an (intuitive) counting argument. Right?
I think your response to John went wrong by saying “the simplest goal compatible with all the problems solved during training”. Every goal is “compatible with all the problems solved during training”, because the best way to satisfy any goal (“maximize blackness” or whatever) is to scheme and give the right answers during training.
Separately, your reply to Evan seems to bring up a different topic, which is basically that deception is more complicated than honesty, because it involves long-term planning. It’s true that “counting arguments provide evidence against NNs having an ability to do effective deception and long-term planning”, other things equal. And indeed, randomly-initialized NNs definitely don’t have that ability, and even GPT-2 didn’t, and indeed even GPT-5 has some limitations on that front. However, if we’re assuming powerful AGI, then (I claim) we’re conditioning on these kinds of abilities being present, i.e. we’re assuming that SGD (or whatever) has already overcome the prior that most policies are not effective deceivers. It doesn’t matter if X is unlikely, if we’re conditioning on X before we even start talking.
Another possible source of confusion here is the difference between Solomonoff prior intuitions versus speed prior intuitions. Remember, Solomonoff induction does not care how long the program takes to run, just how many bits the program takes to specify. Speed prior intuitions say: there are computational limitations on the real-time behavior of the AI. And deception is more computationally expensive than honesty, because you need to separately juggle what’s true versus what story you’re spinning.
If you wanted to say that “Counting arguments provide evidence for AI doom, but also speed-prior arguments provide evidence against AI doom”, then AFAICT you would be correct, although the latter evidence (while nonzero) might be infinitesimal. We can talk about that separately.
As for the “empirical situation” you mentioned here, I think a big contributor is the thing I said above: “Every goal is ‘compatible with all the problems solved during training’, because the best way to satisfy any goal (‘maximize blackness’ or whatever) is to scheme and give the right answers during training.” But that kind of scheming is way beyond the CoinRun RL policies, and I’m not even sure it’s such a great way to think about SOTA LLMs either (see e.g. here). I would agree with the statement “counting arguments provide negligible evidence for AI doom if we assume that situationally-aware-scheming-towards-misaligned-goals is not really on the table for other reasons”.
Thanks: my comments about “simplest goals” were implicitly assuming deep nets are more speed-prior-like than Solomonoff-like, and I should have been explicit about that. I need to think more about the deceptive policies already present before we start talking.
I don’t really know what you are confused by. It seems like you get it. Weigh your goals or goal-generating functions by complexity (or better, try to form any coherent model of what the inductive biases of deep learning training might be and weigh by that). Now you have a counting argument. The counting argument predicts all the things usual counting arguments predict. These arguments also predict that these algorithms generalize (indeed, as Evan says, the arguments would be much weaker if the algorithms didn’t generalize, because you need to invoke simplicity or some kind of distribution over inductive biases to get a well-defined object at all).
Now you have a counting argument. The counting argument predicts all the things usual counting arguments predict.
I guess I’m not sure what you mean by “counting argument.” I understand the phrase to refer to inferences of the form, “Almost all things are wrong, so if you pick a thing, it’s almost certainly going to be wrong.” For example, most lottery tickets aren’t winners, therefore your ticket won’t win the lottery.
But the counting works in the lottery example because there’s an implied uniform prior: every ticket has the same probability as any other. If we’re weighing things by simplicity, what work is being done by counting, as such?
Suppose a biased coin is flipped 1000 times, and it comes up 600 Headses and 400 Tailses. If someone made a counting argument of the form, “Almost all guesses at what the coin’s bias could be are wrong, so if you guess, you’re almost certainly going to be wrong”, that would be wrong: by the asymptotic equipartition property, I can actually be very confident that the bias is close to 0.6 in favor of Heads. You can’t just count the number of things when some of them are much more probable than others. (And if any correct probabilistic argument is a “counting argument”, that’s a really confusing choice of terminology.)
You can’t just count the number of things when some of them are much more probable than others.
I mean, you can. You just multiply the number by the relative probability of the event. Like, in finite event spaces, counting is how you do any likelihood analysis.
I am personally not a huge fan of the term “counting argument”, so I am not very attached to it. Neither are Evan, or Wentworth, or my guess Joe or Eliezer who have written about arguments in this space in the past. Of course, you always need some kind of measure. The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words “most goals are misaligned”.
The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words “most goals are misaligned”.
I think it would be clarifying to try to talk about something other than “human values.” On the If Anyone Builds It resource page for the frequently asked question “Why would an AI steer toward anything other than what it was trained to steer toward?”, the subheader answers, “Because there are many ways to perform well in training.” That’s the class of argument that I understand Pope and Belrose to be critiquing: that deep learning systems can’t be given goals, because all sorts of goals could behave as desired on the training distribution, and almost all of them would do something different outside the training distribution. (Note that “human values” does not appear in the question.)
But the empirical situation doesn’t seem as dire as that kind of “counting argument” suggests. In Langosco et al. 2023′s “Goal Misgeneralization in Deep Reinforcement Learning”, an RL policy trained to collect a coin that was always at the right edge of a video game level learned to go right rather than collect the coin—but randomizing the position of the coin in just a couple percent of training episodes fixed the problem.
It’s not a priori obvious that would work! You could imagine that the policy would learn some crazy thing that happened to match the randomized examples, but didn’t generalize to coin-seeking. Indeed, there are astronomically many such crazy things—but they would seem to be more complicated than the intended generalization of coin-seeking. The counting argument of the form, “You don’t get what you train for, because there are many ways to perform well in training” doesn’t seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
The counting argument of the form, “You don’t get what you train for, because there are many ways to perform well in training” doesn’t seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
I mean, why? Weighted by any reasonable attempt I have for operationalizing the neural network prior, you don’t get what you want out of training because the number of low-loss, high-prior generalizations that do not generalize the way I want vastly outweigh the number of low-loss, high-prior generalizations that do generalize the way I want (you might “get what you train for” because I don’t really know what that means).
I agree that it’s not necessary to talk about “human values” in this context. You might want to get something out of your AI systems that is closer to corrigibility, or some kind of deference, etc. However, that also doesn’t fall nicely out of any analysis of the neural network prior and associated training dynamics either.
The question is why that argument doesn’t rule out all the things we do successfully use deep learning for. Do image classification, or speech synthesis, or helpful assistants that speak natural language and know everything on the internet “fall nicely out of any analysis of the neural network prior and associated training dynamics”? These applications are only possible because generalization often works out in our favor. (For example, LLM assistants follow instructions that they haven’t seen before, and can even follow instructions in other languages despite the instruction-tuning data being in English.)
Again, obviously that doesn’t mean superintelligence won’t kill the humans for any number of otherreasons that we’ve both read many hundreds of thousands of words about. But in order to convince people not to build it, we want to use the best, most convincing arguments, and “you don’t get what you want out of training” as a generic objection to deep learning isn’t very convincing if it proves too much.
There’s a lot I don’t like about this post (trying to do away with the principle of indifference or goals is terrible greedy reductionism), but the core point, that goal counting arguments of the form “many goals could perform well on the training set, so you’ll probably get the wrong one” seem to falsely imply that neural networks shouldn’t generalize (because many functions could perform well on the training set), seems so important and underappreciated that I might have to give it a very grudging +9. (Update: downgraded to +1 in light of Steven Byrnes’s comment.)
In the comments, Evan Hubinger and John Wentworth argue for a corrected goal-counting argument, but in both cases, I don’t get it: it seems to me that simpler goals should be favored, rather than choosing randomly from a large space of equally probable goals. (This doesn’t solve the alignment problem, because the simplest generalization need not be the one we want, per “List of Lethalities” #20.)
I am painfully aware that the problem might be on my end. (I’m saying “I don’t get it”, not “This is wrong.”) Could someone help me out here? What does the correct goal-counting argument look like?
This is still correct, even though the “models should overfit” conclusion is false, because simpler functions have more networks (parameterizations) corresponding to them.
(I might be missing where you’re coming from, but worth a shot.)
The counting argument and the simplicity argument are two sides of the same coin.
The counting perspective says: {“The goal is AARDVARK”, “The goal is ABACUS”, …} is a much longer list of things than {“The goal is HELPFULNESS”}.
The simplicity perspective says: “The goal is” is a shorter string than “The goal is helpfulness”.
(…And then what happens with “The goal is”? Well, let’s say it reads whatever uninitialized garbage RAM bits happen to be in the slots after the word “is”, and that random garbage is its goal.)
Any sensible prior over bit-strings will wind up with some equivalence between the simplicity perspective and the counting perspective, because simpler specifications leave extra bits that can be set arbitrarily, as Evan discussed.
And so I think I think OP is just totally wrong that either the counting perspective or the simplicity perspective (which, again, are equivalent) predict that NNs shouldn’t generalize, per Evan’s comment.
Your comment here says “simpler goals should be favored”, which is fine, but “pick the goal randomly at initialization” is actually a pretty simple spec. Alternatively, if we were going to pick “the one simplest possible goal” (like, I dunno, “the goal is blackness”?), then this goal would obviously be egregiously misaligned, and if I were going to casually explain to a friend why the one simplest possible goal would obviously be egregiously misaligned, I would use an (intuitive) counting argument. Right?
I think your response to John went wrong by saying “the simplest goal compatible with all the problems solved during training”. Every goal is “compatible with all the problems solved during training”, because the best way to satisfy any goal (“maximize blackness” or whatever) is to scheme and give the right answers during training.
Separately, your reply to Evan seems to bring up a different topic, which is basically that deception is more complicated than honesty, because it involves long-term planning. It’s true that “counting arguments provide evidence against NNs having an ability to do effective deception and long-term planning”, other things equal. And indeed, randomly-initialized NNs definitely don’t have that ability, and even GPT-2 didn’t, and indeed even GPT-5 has some limitations on that front. However, if we’re assuming powerful AGI, then (I claim) we’re conditioning on these kinds of abilities being present, i.e. we’re assuming that SGD (or whatever) has already overcome the prior that most policies are not effective deceivers. It doesn’t matter if X is unlikely, if we’re conditioning on X before we even start talking.
Another possible source of confusion here is the difference between Solomonoff prior intuitions versus speed prior intuitions. Remember, Solomonoff induction does not care how long the program takes to run, just how many bits the program takes to specify. Speed prior intuitions say: there are computational limitations on the real-time behavior of the AI. And deception is more computationally expensive than honesty, because you need to separately juggle what’s true versus what story you’re spinning.
If you wanted to say that “Counting arguments provide evidence for AI doom, but also speed-prior arguments provide evidence against AI doom”, then AFAICT you would be correct, although the latter evidence (while nonzero) might be infinitesimal. We can talk about that separately.
As for the “empirical situation” you mentioned here, I think a big contributor is the thing I said above: “Every goal is ‘compatible with all the problems solved during training’, because the best way to satisfy any goal (‘maximize blackness’ or whatever) is to scheme and give the right answers during training.” But that kind of scheming is way beyond the CoinRun RL policies, and I’m not even sure it’s such a great way to think about SOTA LLMs either (see e.g. here). I would agree with the statement “counting arguments provide negligible evidence for AI doom if we assume that situationally-aware-scheming-towards-misaligned-goals is not really on the table for other reasons”.
Thanks: my comments about “simplest goals” were implicitly assuming deep nets are more speed-prior-like than Solomonoff-like, and I should have been explicit about that. I need to think more about the deceptive policies already present before we start talking.
I don’t really know what you are confused by. It seems like you get it. Weigh your goals or goal-generating functions by complexity (or better, try to form any coherent model of what the inductive biases of deep learning training might be and weigh by that). Now you have a counting argument. The counting argument predicts all the things usual counting arguments predict. These arguments also predict that these algorithms generalize (indeed, as Evan says, the arguments would be much weaker if the algorithms didn’t generalize, because you need to invoke simplicity or some kind of distribution over inductive biases to get a well-defined object at all).
I guess I’m not sure what you mean by “counting argument.” I understand the phrase to refer to inferences of the form, “Almost all things are wrong, so if you pick a thing, it’s almost certainly going to be wrong.” For example, most lottery tickets aren’t winners, therefore your ticket won’t win the lottery.
But the counting works in the lottery example because there’s an implied uniform prior: every ticket has the same probability as any other. If we’re weighing things by simplicity, what work is being done by counting, as such?
Suppose a biased coin is flipped 1000 times, and it comes up 600 Headses and 400 Tailses. If someone made a counting argument of the form, “Almost all guesses at what the coin’s bias could be are wrong, so if you guess, you’re almost certainly going to be wrong”, that would be wrong: by the asymptotic equipartition property, I can actually be very confident that the bias is close to 0.6 in favor of Heads. You can’t just count the number of things when some of them are much more probable than others. (And if any correct probabilistic argument is a “counting argument”, that’s a really confusing choice of terminology.)
I mean, you can. You just multiply the number by the relative probability of the event. Like, in finite event spaces, counting is how you do any likelihood analysis.
I am personally not a huge fan of the term “counting argument”, so I am not very attached to it. Neither are Evan, or Wentworth, or my guess Joe or Eliezer who have written about arguments in this space in the past. Of course, you always need some kind of measure. The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words “most goals are misaligned”.
I think it would be clarifying to try to talk about something other than “human values.” On the If Anyone Builds It resource page for the frequently asked question “Why would an AI steer toward anything other than what it was trained to steer toward?”, the subheader answers, “Because there are many ways to perform well in training.” That’s the class of argument that I understand Pope and Belrose to be critiquing: that deep learning systems can’t be given goals, because all sorts of goals could behave as desired on the training distribution, and almost all of them would do something different outside the training distribution. (Note that “human values” does not appear in the question.)
But the empirical situation doesn’t seem as dire as that kind of “counting argument” suggests. In Langosco et al. 2023′s “Goal Misgeneralization in Deep Reinforcement Learning”, an RL policy trained to collect a coin that was always at the right edge of a video game level learned to go right rather than collect the coin—but randomizing the position of the coin in just a couple percent of training episodes fixed the problem.
It’s not a priori obvious that would work! You could imagine that the policy would learn some crazy thing that happened to match the randomized examples, but didn’t generalize to coin-seeking. Indeed, there are astronomically many such crazy things—but they would seem to be more complicated than the intended generalization of coin-seeking. The counting argument of the form, “You don’t get what you train for, because there are many ways to perform well in training” doesn’t seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
I mean, why? Weighted by any reasonable attempt I have for operationalizing the neural network prior, you don’t get what you want out of training because the number of low-loss, high-prior generalizations that do not generalize the way I want vastly outweigh the number of low-loss, high-prior generalizations that do generalize the way I want (you might “get what you train for” because I don’t really know what that means).
I agree that it’s not necessary to talk about “human values” in this context. You might want to get something out of your AI systems that is closer to corrigibility, or some kind of deference, etc. However, that also doesn’t fall nicely out of any analysis of the neural network prior and associated training dynamics either.
The question is why that argument doesn’t rule out all the things we do successfully use deep learning for. Do image classification, or speech synthesis, or helpful assistants that speak natural language and know everything on the internet “fall nicely out of any analysis of the neural network prior and associated training dynamics”? These applications are only possible because generalization often works out in our favor. (For example, LLM assistants follow instructions that they haven’t seen before, and can even follow instructions in other languages despite the instruction-tuning data being in English.)
Again, obviously that doesn’t mean superintelligence won’t kill the humans for any number of other reasons that we’ve both read many hundreds of thousands of words about. But in order to convince people not to build it, we want to use the best, most convincing arguments, and “you don’t get what you want out of training” as a generic objection to deep learning isn’t very convincing if it proves too much.