Here’s my attempted phrasing, which I think avoids some of the common confusions:
Suppose we have a model M with utility function ϕ, where M is not capable of taking over the world. Assume that thanks to a bunch of alignment work, ϕ is within δ (by some metric) of humanity’s collective utility function. Then in the process of maximizing ϕ, M ends up doing a bunch of vaguely helpful stuff.
Then someone releases model M′ with utility function ϕ′, where M′ is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, ϕ′ is also within δ′ of humanity’s collective utility function, where δ′≤δ Then in the process of maximizing ϕ′, M′ gets rid of humans and rearranges their molecules to satisfy ϕ′ better.
This is an excellent encapsulation of (I think) something different—the “fragility of value” issue: “formerly adequate levels of alignment can become inadequate when applied to a takeover-capable agent.” I think the “generalization gap” issue is “those perfectly-generalizing alignment techniques must generalize perfectly on the first try”.
Attempting to deconfuse myself about how that works if it’s “continuous” (someone has probably written the thing that would deconfuse me, but as an exercise): if AI power progress is “continuous” (which training is, but model-sequence isn’t), it goes from “you definitely don’t have to get it right at all to survive” to “you definitely get only one try to get it sufficiently right, if you want to survive,” but by what path? In which of the terms “definitely,” “one,” and “sufficiently” is it moving continuously, if any?
I certainly don’t think it’s via the number of tries you get to survive! I struggle to imagine an AI where we all die if we fail to align it three times in a row.
I don’t put any stock in “sufficiently,” either—I don’t believe in a takeover-capable AI that’s aligned enough to not work toward takeover, but which would work toward takeover if it were even more capable. (And even if one existed, it would have to eschew RSI and other instrumentally convergent things, else it would just count as a takeover-causing AI.)
It might be via the confidence of the statement. Now, I don’t expect AIs to launch highly-contingent outright takeover attempts; if they’re smart enough to have a reasonable chance of succeeding, I think they’ll be self-aware enough to bide their time, suppress the development of rival AIs, and do instrumentally convergent stuff while seeming friendly. But there is some level of self-knowledge at which an AI will start down the path toward takeover (e.g., extricating itself, sabotaging rivals) and succeed with a probability that’s very much neither 0 nor 1. Is this first, weakish, self-aware AI able to extricate itself? It depends! But I still expect the relevant band of AI capabilities here to be pretty narrow, and we get no guarantee it will exist at all. And we might skip over it with a fancy new model (if it was sufficiently immobilized during training or guarded its goals well).
Of course, there’s still a continuity in expectation: when training each more powerful model, it has some probability of being The Big One. But yeah, I more or less predict a Big One; I believe in an essential discontinuity arising here from a continuous process. The best analogy I can think of is how every exponential with r<1 dies out and every r>1 goes off to infinity. When you allow dynamic systems, you naturally get cuspy behavior.
Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]: 1. Because of the way AI systems are trained, δ,δ′ will be large even if we knew humanity’s collective utility function and could target that (this is inner misalignment) 2. Even if δ′ were fairly small, this would still result in catastrophic outcomes if M′ is an extremely powerful optimizer (this is fragility of value)
A few questions: 3. Are the claims (1) and (2) accurate representations of inner misalignment and fragility of value? 4. Is the “misgeneralization” claim just ”δ′ will be much larger than δ”?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that (1) and (2) are sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than (1) and (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
I like your made up notation. I’ll try to answer, but I’m an amateur in both reasoning-about-this-stuff and representing-others’-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the “generalization step is hard” point is roughly “you can get δ low by trial and error. The technique you found at the end that gets δ low—it better not intrinsically depend on the trial and error process, because you don’t get to do trial and error on δ‘. Moreover, it better actually work on M’.”
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That’s one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don’t like “generalization” as a name for this step. Maybe “extension”? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ′ (verbiage different because of the first-time constraint), the disastrousness of even smallish δ’…
Here’s my attempted phrasing, which I think avoids some of the common confusions:
Suppose we have a model M with utility function ϕ, where M is not capable of taking over the world. Assume that thanks to a bunch of alignment work, ϕ is within δ (by some metric) of humanity’s collective utility function. Then in the process of maximizing ϕ, M ends up doing a bunch of vaguely helpful stuff.
Then someone releases model M′ with utility function ϕ′, where M′ is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, ϕ′ is also within δ′ of humanity’s collective utility function, where δ′≤δ Then in the process of maximizing ϕ′, M′ gets rid of humans and rearranges their molecules to satisfy ϕ′ better.
Does this phrasing seem accurate and helpful?
This is an excellent encapsulation of (I think) something different—the “fragility of value” issue: “formerly adequate levels of alignment can become inadequate when applied to a takeover-capable agent.” I think the “generalization gap” issue is “those perfectly-generalizing alignment techniques must generalize perfectly on the first try”.
Attempting to deconfuse myself about how that works if it’s “continuous” (someone has probably written the thing that would deconfuse me, but as an exercise): if AI power progress is “continuous” (which training is, but model-sequence isn’t), it goes from “you definitely don’t have to get it right at all to survive” to “you definitely get only one try to get it sufficiently right, if you want to survive,” but by what path? In which of the terms “definitely,” “one,” and “sufficiently” is it moving continuously, if any?
I certainly don’t think it’s via the number of tries you get to survive! I struggle to imagine an AI where we all die if we fail to align it three times in a row.
I don’t put any stock in “sufficiently,” either—I don’t believe in a takeover-capable AI that’s aligned enough to not work toward takeover, but which would work toward takeover if it were even more capable. (And even if one existed, it would have to eschew RSI and other instrumentally convergent things, else it would just count as a takeover-causing AI.)
It might be via the confidence of the statement. Now, I don’t expect AIs to launch highly-contingent outright takeover attempts; if they’re smart enough to have a reasonable chance of succeeding, I think they’ll be self-aware enough to bide their time, suppress the development of rival AIs, and do instrumentally convergent stuff while seeming friendly. But there is some level of self-knowledge at which an AI will start down the path toward takeover (e.g., extricating itself, sabotaging rivals) and succeed with a probability that’s very much neither 0 nor 1. Is this first, weakish, self-aware AI able to extricate itself? It depends! But I still expect the relevant band of AI capabilities here to be pretty narrow, and we get no guarantee it will exist at all. And we might skip over it with a fancy new model (if it was sufficiently immobilized during training or guarded its goals well).
Of course, there’s still a continuity in expectation: when training each more powerful model, it has some probability of being The Big One. But yeah, I more or less predict a Big One; I believe in an essential discontinuity arising here from a continuous process. The best analogy I can think of is how every exponential with r<1 dies out and every r>1 goes off to infinity. When you allow dynamic systems, you naturally get cuspy behavior.
Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]:
1. Because of the way AI systems are trained, δ,δ′ will be large even if we knew humanity’s collective utility function and could target that (this is inner misalignment)
2. Even if δ′ were fairly small, this would still result in catastrophic outcomes if M′ is an extremely powerful optimizer (this is fragility of value)
A few questions:
3. Are the claims (1) and (2) accurate representations of inner misalignment and fragility of value?
4. Is the “misgeneralization” claim just ”δ′ will be much larger than δ”?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that (1) and (2) are sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than (1) and (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
There’s probably a better name for this. Please suggest one!
Non-exhaustive list.
I like your made up notation. I’ll try to answer, but I’m an amateur in both reasoning-about-this-stuff and representing-others’-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the “generalization step is hard” point is roughly “you can get δ low by trial and error. The technique you found at the end that gets δ low—it better not intrinsically depend on the trial and error process, because you don’t get to do trial and error on δ‘. Moreover, it better actually work on M’.”
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That’s one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don’t like “generalization” as a name for this step. Maybe “extension”? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ′ (verbiage different because of the first-time constraint), the disastrousness of even smallish δ’…