Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]: 1. Because of the way AI systems are trained, δ,δ′ will be large even if we knew humanity’s collective utility function and could target that (this is inner misalignment) 2. Even if δ′ were fairly small, this would still result in catastrophic outcomes if M′ is an extremely powerful optimizer (this is fragility of value)
A few questions: 3. Are the claims (1) and (2) accurate representations of inner misalignment and fragility of value? 4. Is the “misgeneralization” claim just ”δ′ will be much larger than δ”?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that (1) and (2) are sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than (1) and (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
I like your made up notation. I’ll try to answer, but I’m an amateur in both reasoning-about-this-stuff and representing-others’-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the “generalization step is hard” point is roughly “you can get δ low by trial and error. The technique you found at the end that gets δ low—it better not intrinsically depend on the trial and error process, because you don’t get to do trial and error on δ‘. Moreover, it better actually work on M’.”
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That’s one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don’t like “generalization” as a name for this step. Maybe “extension”? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ′ (verbiage different because of the first-time constraint), the disastrousness of even smallish δ’…
Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]:
1. Because of the way AI systems are trained, δ,δ′ will be large even if we knew humanity’s collective utility function and could target that (this is inner misalignment)
2. Even if δ′ were fairly small, this would still result in catastrophic outcomes if M′ is an extremely powerful optimizer (this is fragility of value)
A few questions:
3. Are the claims (1) and (2) accurate representations of inner misalignment and fragility of value?
4. Is the “misgeneralization” claim just ”δ′ will be much larger than δ”?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that (1) and (2) are sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than (1) and (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
There’s probably a better name for this. Please suggest one!
Non-exhaustive list.
I like your made up notation. I’ll try to answer, but I’m an amateur in both reasoning-about-this-stuff and representing-others’-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the “generalization step is hard” point is roughly “you can get δ low by trial and error. The technique you found at the end that gets δ low—it better not intrinsically depend on the trial and error process, because you don’t get to do trial and error on δ‘. Moreover, it better actually work on M’.”
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That’s one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don’t like “generalization” as a name for this step. Maybe “extension”? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ′ (verbiage different because of the first-time constraint), the disastrousness of even smallish δ’…