[Question] Generalization and the Multiple Stage Fallacy?

Zack_M_Davis7 Oct 2025 6:20 UTC

34 points

Doomimir: The possibility of AGI being developed gradually doesn’t obviate the problem of the “first critical try”: the vast hypermajority of AGIs that seem aligned in the “Before” regime when they’re weaker than humans, will still want to kill the humans “After” they’re stronger and the misalignment can no longer be “corrected”. The speed of the transition between those regimes doesn’t matter. The problem still exists and is still fatal whether it takes a day or a decade.

Simplicia: I agree that the risk you describe is real, but I don’t understand why you’re so sure the risk is high. As we’ve discussed previously, the surprising fact that deep learning works at all comes down to generalization. In principle, an astronomical number of functions are compatible with the training data, the astronomical supermajority of which do something crazy or useless on non-training inputs. But the network doesn’t have a uniform prior on all possible functions compatible with the training data; there’s a bias towards simple functions that “generalize well” in some sense.

There’s definitely a risk of goal misgeneralization, where we were mistaken about how behavior in the Before regime generalizes to behavior in the After regime. But if we work hard to test and iterate on our AI’s behavior in the settings where we can observe and correct it, isn’t there hope of it generalizing to behave well once we can no longer correct it? In analogy, it’s not inhumanly hard to design and build machines on land that successfully generalize to functioning on an airplane or in a submarine.

Doomimir: Or in space, or inside the sun? Suppose that two dozen things change between Before and After. Even if you anticipate and try to devise solutions for three-quarters of them, not all of your solutions are going to work on the first critical try, and then there are the problems you failed to anticipate. This isn’t the kind of thing humans beings can pull off in real life.

Simplicia: Sorry, this is probably a stupid question, but isn’t that reasoning similar to the multiple stage fallacy that you’ve derided elsewhere?

That is, in the multiple stage fallacy, someone who wishes to portray a proposition as unlikely can prey on people’s reluctance to assign extreme probabilities by spuriously representing the proposition as a conjunction of sub-propositions that all need to be true.

As an illustrative example, suppose that the “correct” probability of some proposition $Q$ is 0.9. Someone who wants to argue that $Q$ is unlikely represents it as a conjunction of two dozen sub-propositions: $Q$ is true if and only if $Q_{1}$ is true, and $Q_{2}$ is true given that $Q_{1}$ is true, and $Q_{3}$ is true given that $Q_{1}$ and $Q_{2}$ are true, and so on up to $Q_{24}$ .

Someone who assumed the sub-propositions $Q_{i}$ were independent and assigned them each a probability 0.95 would only assign $Q$ a probability of ${0.95}^{24} \approx 0.29$ . Indeed, with the assumption of independence of the sub-propositions $Q_{i}$ , one would need to assign each $Q_{i}$ an intuitively “extreme”-looking probability of $exp \frac{log 0.9}{24} \approx 0.996$ in order to assign $Q$ the correct probability of 0.9. Which should be a clue that the $Q_{i}$ aren’t really independent, that that choice of decomposition into sub-propositions was a poor one—with respect to the goal of getting the right answer, as contrasted to the goal of tricking respondents into assigning a low probability to $Q$ .

So when you posit that two dozen things change between a detectable/correctable-failures regime and a fatal regime, such that the conjunctive probability of not hitting a fatal misgeneralization is tiny, how do I know you’re not committing a multiple stage fallacy? Why is that a “correct”, non-misleading decomposition into sub-propositions?

In analogy, someone who wanted to argue that it’s infeasible to build text-to-image generative models that make art that humans enjoy, could partition their prediction of failure into disjunctive failure modes: the model has to generalize what hands look like; it has to generalize what birds look like. It has to generalize which compositions and color combinations are pleasing—which is arguably an “ought”/steering problem, not an “is”/prediction problem! One-shotting all of those separate problems isn’t something that human beings can do in real life, the argument would go. But of course, the problems aren’t independent, and text-to-image generators do exist.

Is there a version of your argument that doesn’t depend on the equivalent of, “Suppose there are twenty-four independent things that can go wrong, surely you don’t want to bet the world on them each succeeding with probability 0.996”?

Doomimir: You’re right: that is a stupid question.

Simplicia: [head down in shame] I know. It’s only … [straightening up] I would like to know the answer, though. [turning to the audience] Do you know?

Zack_M_Davis7 Oct 2025 6:20 UTC

34 points

6 comments3 min readLW link

AI Dialogue (format)Rationality

No answers.

Raemon 7 Oct 2025 9:41 UTC
10 points
11
In analogy, someone who wanted to argue that it’s infeasible to build text-to-image generative models that make art that humans enjoy, could partition their prediction of failure into disjunctive failure modes: the model has to generalize what hands look like; it has to generalize what birds look like. It has to generalize which compositions and color combinations are pleasing—which is arguably an “ought”/steering problem, not an “is”/prediction problem! One-shotting all of those separate problems isn’t something that human beings can do in real life, the argument would go. But of course, the problems aren’t independent, and text-to-image generators do exist.
Isn’t part of the deal here that we didn’t one-shot image generation, though?
The first image generators were crazy, we slowly iterated on them, and image generation is “easy” because unlike superintelligence or even self-driving cars or regular ol’ production code, nothing particularly bad happens if a given image is bad.
- Raemon 7 Oct 2025 20:57 UTC
  2 points
  0
  Parent
  That said, FYI I was kind of enlightened by this phrasing:
  That is, in the multiple stage fallacy, someone who wishes to portray a proposition as unlikely can prey on people’s reluctance to assign extreme probabilities by spuriously representing the proposition as a conjunction of sub-propositions that all need to be true.
  I’d been feeling sus about why the multiple stage fallacy was even a fallacy at all, apart from “somehow in practice people fuck it up.” Multiplying probabilities together is… like, how else are you supposed to do any kind of sophisticated reasoning?
  But, “because people are scared of (or bad at) assigning extreme probabilities” feels like it explains it to me.
Vladimir_Nesov 7 Oct 2025 15:13 UTC
2 points
0
There are two different issues with “the first critical try” (the After regime), where misalignment is lethal. First, maybe alignment is sufficiently solved, and so when you enter After, that’s why it doesn’t kill you. But second, maybe After never arrives.

Gradualist arguments press both issues, not just alignment of After. Sufficient control makes increasingly capable AIs non-lethal if misaligned, which means that an AI that would bring about the After regime today wouldn’t do so in the future where better countermeasures (that are not about alignment) are in place. Which is to say this particular AI won’t enter the After regime yet, because the world is sufficiently different and this AI’s capabilities are now insufficient for lethality, an even more capable AI would be necessary for that.

This is different from an ASI Pause delaying the After regime until ASI-grade alignment is solved, because the level of capabilities that counts as After keeps changing. Instead of delaying ASI at a fixed level of capabilities until alignment is solved, After is being pushed into the future by increasing levels of control that make increasingly capable AIs non-critical. As a result, After never happens at all, instead of only happening once alignment at a relevant level is sufficiently solved.

(Of course the feasibility of ASI-grade control is as flimsy as the feasibility of ASI-grade alignment, when working on a capabilities schedule without an AGI/ASI Pause, not to mention gradual disempowerment in the gradualist regime without an AGI Pause. But the argument is substantially different, and a proponent of gradualist development of ASI-grade control might feel that there is no fixed After, and maybe that After never actually arrives even as capabilities keep increasing. The arguments against feasibility of gradualist development of ASI-grade alignment on the other hand feel like they are positing a fixed After whose arrival remains inevitable at some point, which doesn’t acknowledge the framing from gradualist arguments about development of ASI-grade control.)