(I might be missing where you’re coming from, but worth a shot.)
The counting argument and the simplicity argument are two sides of the same coin.
The counting perspective says: {“The goal is AARDVARK”, “The goal is ABACUS”, …} is a much longer list of things than {“The goal is HELPFULNESS”}.
The simplicity perspective says: “The goal is” is a shorter string than “The goal is helpfulness”.
(…And then what happens with “The goal is”? Well, let’s say it reads whatever uninitialized garbage RAM bits happen to be in the slots after the word “is”, and that random garbage is its goal.)
Any sensible prior over bit-strings will wind up with some equivalence between the simplicity perspective and the counting perspective, because simpler specifications leave extra bits that can be set arbitrarily, as Evan discussed.
And so I think I think OP is just totally wrong that either the counting perspective or the simplicity perspective (which, again, are equivalent) predict that NNs shouldn’t generalize, per Evan’s comment.
Your comment here says “simpler goals should be favored”, which is fine, but “pick the goal randomly at initialization” is actually a pretty simple spec. Alternatively, if we were going to pick “the one simplest possible goal” (like, I dunno, “the goal is blackness”?), then this goal would obviously be egregiously misaligned, and if I were going to casually explain to a friend why the one simplest possible goal would obviously be egregiously misaligned, I would use an (intuitive) counting argument. Right?
I think your response to John went wrong by saying “the simplest goal compatible with all the problems solved during training”. Every goal is “compatible with all the problems solved during training”, because the best way to satisfy any goal (“maximize blackness” or whatever) is to scheme and give the right answers during training.
Separately, your reply to Evan seems to bring up a different topic, which is basically that deception is more complicated than honesty, because it involves long-term planning. It’s true that “counting arguments provide evidence against NNs having an ability to do effective deception and long-term planning”, other things equal. And indeed, randomly-initialized NNs definitely don’t have that ability, and even GPT-2 didn’t, and indeed even GPT-5 has some limitations on that front. However, if we’re assuming powerful AGI, then (I claim) we’re conditioning on these kinds of abilities being present, i.e. we’re assuming that SGD (or whatever) has already overcome the prior that most policies are not effective deceivers. It doesn’t matter if X is unlikely, if we’re conditioning on X before we even start talking.
Another possible source of confusion here is the difference between Solomonoff prior intuitions versus speed prior intuitions. Remember, Solomonoff induction does not care how long the program takes to run, just how many bits the program takes to specify. Speed prior intuitions say: there are computational limitations on the real-time behavior of the AI. And deception is more computationally expensive than honesty, because you need to separately juggle what’s true versus what story you’re spinning.
If you wanted to say that “Counting arguments provide evidence for AI doom, but also speed-prior arguments provide evidence against AI doom”, then AFAICT you would be correct, although the latter evidence (while nonzero) might be infinitesimal. We can talk about that separately.
As for the “empirical situation” you mentioned here, I think a big contributor is the thing I said above: “Every goal is ‘compatible with all the problems solved during training’, because the best way to satisfy any goal (‘maximize blackness’ or whatever) is to scheme and give the right answers during training.” But that kind of scheming is way beyond the CoinRun RL policies, and I’m not even sure it’s such a great way to think about SOTA LLMs either (see e.g. here). I would agree with the statement “counting arguments provide negligible evidence for AI doom if we assume that situationally-aware-scheming-towards-misaligned-goals is not really on the table for other reasons”.
Thanks: my comments about “simplest goals” were implicitly assuming deep nets are more speed-prior-like than Solomonoff-like, and I should have been explicit about that. I need to think more about the deceptive policies already present before we start talking.
(I might be missing where you’re coming from, but worth a shot.)
The counting argument and the simplicity argument are two sides of the same coin.
The counting perspective says: {“The goal is AARDVARK”, “The goal is ABACUS”, …} is a much longer list of things than {“The goal is HELPFULNESS”}.
The simplicity perspective says: “The goal is” is a shorter string than “The goal is helpfulness”.
(…And then what happens with “The goal is”? Well, let’s say it reads whatever uninitialized garbage RAM bits happen to be in the slots after the word “is”, and that random garbage is its goal.)
Any sensible prior over bit-strings will wind up with some equivalence between the simplicity perspective and the counting perspective, because simpler specifications leave extra bits that can be set arbitrarily, as Evan discussed.
And so I think I think OP is just totally wrong that either the counting perspective or the simplicity perspective (which, again, are equivalent) predict that NNs shouldn’t generalize, per Evan’s comment.
Your comment here says “simpler goals should be favored”, which is fine, but “pick the goal randomly at initialization” is actually a pretty simple spec. Alternatively, if we were going to pick “the one simplest possible goal” (like, I dunno, “the goal is blackness”?), then this goal would obviously be egregiously misaligned, and if I were going to casually explain to a friend why the one simplest possible goal would obviously be egregiously misaligned, I would use an (intuitive) counting argument. Right?
I think your response to John went wrong by saying “the simplest goal compatible with all the problems solved during training”. Every goal is “compatible with all the problems solved during training”, because the best way to satisfy any goal (“maximize blackness” or whatever) is to scheme and give the right answers during training.
Separately, your reply to Evan seems to bring up a different topic, which is basically that deception is more complicated than honesty, because it involves long-term planning. It’s true that “counting arguments provide evidence against NNs having an ability to do effective deception and long-term planning”, other things equal. And indeed, randomly-initialized NNs definitely don’t have that ability, and even GPT-2 didn’t, and indeed even GPT-5 has some limitations on that front. However, if we’re assuming powerful AGI, then (I claim) we’re conditioning on these kinds of abilities being present, i.e. we’re assuming that SGD (or whatever) has already overcome the prior that most policies are not effective deceivers. It doesn’t matter if X is unlikely, if we’re conditioning on X before we even start talking.
Another possible source of confusion here is the difference between Solomonoff prior intuitions versus speed prior intuitions. Remember, Solomonoff induction does not care how long the program takes to run, just how many bits the program takes to specify. Speed prior intuitions say: there are computational limitations on the real-time behavior of the AI. And deception is more computationally expensive than honesty, because you need to separately juggle what’s true versus what story you’re spinning.
If you wanted to say that “Counting arguments provide evidence for AI doom, but also speed-prior arguments provide evidence against AI doom”, then AFAICT you would be correct, although the latter evidence (while nonzero) might be infinitesimal. We can talk about that separately.
As for the “empirical situation” you mentioned here, I think a big contributor is the thing I said above: “Every goal is ‘compatible with all the problems solved during training’, because the best way to satisfy any goal (‘maximize blackness’ or whatever) is to scheme and give the right answers during training.” But that kind of scheming is way beyond the CoinRun RL policies, and I’m not even sure it’s such a great way to think about SOTA LLMs either (see e.g. here). I would agree with the statement “counting arguments provide negligible evidence for AI doom if we assume that situationally-aware-scheming-towards-misaligned-goals is not really on the table for other reasons”.
Thanks: my comments about “simplest goals” were implicitly assuming deep nets are more speed-prior-like than Solomonoff-like, and I should have been explicit about that. I need to think more about the deceptive policies already present before we start talking.